[Toybox] [PATCH] grep: add --line-buffered and fix regular buffering.

Rob Landley rob at landley.net
Tue Apr 16 21:50:29 PDT 2019


On 4/16/19 7:42 PM, enh wrote:
>> Line buffered is the behavior I've been assuming, but I see glibc defaults to
>> "unbuffered"... in which case what is FILE * _for_ again? A strrchr() on each
>> buffer with a flush to that point isn't that expensive, and most of the stuff
>> that cares about micromanaging buffers is already using filehandles instead of
>> FILE * anyway...
> 
> i think the problem here is that you're looking at this through the
> wrong lens. i think your mental model is "interactive user", whereas
> i'm thinking of them *and* of scripts/builds.

I'm thinking of both, but didn't see how scripts should be significantly
penalized here?

> the traditional behavior of trusting the C library's buffering and
> having just one exception (that i've ever seen) in grep _is_ sane.
>
> i don't know where you misapprehensions about libc's behavior are
> coming from...
When I first met Denys Vlasenko in person (the guy I handed busybox
maintainerhip off to), it's because he attended a US conferene talking about how
he'd fixed a glibc bug that caused statically linked binaries with --gc-sections
to not flush stdout on exit.

  https://elinux.org/images/2/2d/ELC2010-gc-sections_Denys_Vlasenko.pdf
  https://bootlin.com/pub/video/2010/elc/elc2010-vlasenko-dead-code.ogv

(I hit it long before then, but he root caused it. It was of course an issue
where the glibc developers and the gcc developers were pointing at each other
and each telling the other to change, and as a result the issue didn't get fixed
for something like 5 years. Of course I think that was back when Ulrich "static
linking considered harmful" drepper was still maintainer,
https://www.akkadia.org/drepper/no_static_linking.html. He was replaced by a
committee that broke up in 2012 https://lwn.net/Articles/488847/ but it's still
a gnu/dammit project and that's bad.)

The "tee" command broke so it only passed through a 4k block of data at a time
(rather than unbuffered output) _three_times_ over the years I used it. As in
broke got fixed and broke again, 3 times, all on Linux. And that's not counting
busybox breaking it and me fixing it there:

  http://lists.busybox.net/pipermail/busybox/2004-May/045755.html

And yes, the fix was to replace fread(), which was 4k buffering, with read().

Here's a quick grep of my blog of _just_ fighting with the "tee" command over
the years, not all of which was glibc of course:

https://landley.net/notes-2005.html#23-08-2005
https://landley.net/notes-2006.html#06-03-2007
https://landley.net/notes-2008.html#07-10-2008
https://landley.net/notes-2008.html#03-07-2008
https://landley.net/notes-2011.html#24-08-2011

The deepest dive to fix a tee issue slightly predates the current iteration of
my blog:

https://marc.info/?l=user-mode-linux-devel&m=111056135413631&w=2

And the most _recent_ time glibc broke this was 2017:

  https://landley.net/notes-2017.html#08-04-2017

And yes, it was "stdout winds up fully buffered for some unknown reason, same
build against musl did not, fix it with gratuitous flushes"...

> they're not true of bionic, nor any version of glibc
> i've ever seen.

How about if I explicitly ask for _unbuffered_ in the common code (which
according to you should be a NOP) and then implemnt --line-buffered for grep?

> POSIX explicitly says:
> 
>   When opened, the standard error stream is not fully buffered; the
> standard input and standard output streams are fully buffered if and
> only if the stream can be determined not to refer to an interactive
> device.

So they don't specify whether it's linebuffered or unbuffered, just not _fully_
buffered. And if stdout is fully buffered when you pipe it to tee you get the 4k
blocking problem again. So the bug kept showing up because the standard said the
bug should show up, great.

Except... the fix in the busybox case was on the _read_ side? It's fully
buffering INPUT by default too? That's INSANE. If 4k of data is waiting when you
read, you get/process 4k. If less is waiting, PROCESS WHAT YOU GOT...

> that's the only behavior i've ever seen implemented, and that's the
> behavior you want. scripts/builds don't get pessimized performance,
> and interactive users don't have to wait for 1KiB/4KiB/8KiB.
> 
> (of course, this assumes that the application isn't flushing way too
> often, which i claim is the real and only bug here.)

Which I just removed. I agree, it was flushing way too often, because I was
being more paranoid than necessary about error checking.

But "stdout is unbuffered" and "we flush(stdout) after write" is the
_same_thing_ isn't it?

If it's fully buffered, then "grep | tee" doesn't show you lines as it finds
them. (Not with all those extra flushes removed.) If it's unbuffered, we might
as well have left xprintf() flushing each write.

Fully buffered pipeline output means "watch" will never progressively display
anything unless it fakes a pty to disable that behavior. (Remember how I rewrote
watch so it restarts in a timely fashion rather than blocking arbitrarily long
waiting for output to finish and _then_ timing out? I'd have to _undo_ that if I
humor this tty weirdness, 80x25=2000 a full screen is half this buffer...)

> strictly speaking, glibc has a helper program (stdbuf(1)) that lets
> you modify the buffering for the program it calls, but no other libc
> (that i know of) has anything similar, and i've never seen the glibc
> one used. as far as i know, the grep --line-buffered option, which is
> in both BSD and GNU, is the only place where a command-line utility
> explicitly offers fine control. and since Android's on-device grep has
> always been [Net]BSD and the build grep has been GNU grep, both
> device- and host-side expect this behavior, and control it where
> necessary with --line-buffered.

When you call write() it makes a system call, invokes the scheduler, copies data
between process contexts, often invokes another process that's been waiting for
CPU longer, there's cache flushing all over the place (especially with the
spectre mitigations), and it can have a huge impact on performance if your
bufering granularity is wrong.

At LinuxWorld Expo 2001 Jon "Maddog" Hall told an anecdote in one of his panels
about an old mainframe batch job writing its output to reel-to-reel tape a byte
at a time, each time 1) spinning the tape up to speed skipping tape as it did,
2) writing a data packet with a header and footer around the single byte of
data, and 3) spinning the tape back down skiping more tape. This ate a dozen
tapes and took hours with someone standing by to switch tapes. He changed the
code so it wrote out 64 bytes of data at a time, and suddenly everything fit on
a single tape and was done in 15 minutes, and best of all it could run in the
middle of the night with nobody standing by and they could collect the tape in
the morning.

I have encountered variants of this issue dozens of times over the years, in
many different contexts. It's why TCP/IP has nagle's agorithm, which can be sung
to the dreidl song. It's one of the two hard problems in computer science
(naming things, cache invalidation, and fencepost errors).

What I'm trying to avoid is extra write() system calls and the resulting context
switches and cache flushing, which seem likely to penalize performance far more
than cache-local fidding with an 8k buffer.

The "line buffered" mode boils down to strrchr('\n') on the data we just wrote
and if we find one flush() to that point on the way out. It should never cause
_more_ write() calls than unbuffered mode (except in the case where you
explicitly fflush() afterwards to output a partial line).

So you still get the vast majority of the performance boost from _not_ having
multiple writes per line when grep is calling a bunch of little printf(), but an
average output of 50 chars or so is still 1/50th as many system calls and should
be the bulk of your performance gain.

> there are definitely places where traditional tools do something
> stupid, and no one actually wants the stupid, and i'm happy with those
> deviations. but this is a place where i think toybox is behaving
> differently to no advantage and some disadvantage, whereas stdio left
> to its own devices (without constant hammering of the flush button)
> does a good enough job for both distinct use cases.

Sigh. My next guess would be a --set-block-buffered option for grep (which can
call setvbuf(_IOFBF) to put it back easily enough), but that's not what the
other one has and I doubt you want that.

The thing is even if grep were special-cased to fully buffer into pipelines by
default (and thus piping the output to tee would be interactively broken), fully
buffering is seriously _not_ a performance boost in a lot of cases. In my linux
source git checkout I just did (using the ubuntu 14.04 versions of everything):

  $ time bash -c "grep -ir max_range --line-buffered | head -n 2"
  kernel/dma/debug.c:	unsigned int max_range = dma_get_max_seg_size(ref->dev);
  kernel/dma/debug.c:	while (range <= max_range) {

  real	0m3.289s
  user	0m3.268s
  sys	0m0.022s

  $ time bash -c "grep -ir max_range | head -n 2"
  kernel/dma/debug.c:	unsigned int max_range = dma_get_max_seg_size(ref->dev);
  kernel/dma/debug.c:	while (range <= max_range) {

  real	9m5.821s
  user	4m8.223s
  sys	0m7.350s

Yes that grepped through the .git directory (with history back to 0.0.1), but
it's still 3 seconds vs 9 minutes on a 2.4ghz core i7. Because it chugs through
gigabytes more input confirming there aren't more hits before producing _any_
output. (And that was the _second_ run of the big one, to let the cache populate
so you don't blame my spinning rust. The short one was uncached.)

On an SMP system the faster the other program can receive the data the faster it
can start processing it. The kernel guys redid the pipe infrastructure
extensively a few years back to mitigate the cost of ping-ponging between
processes, and going from 1 byte block transfers to ~60 byte block transfers
should be the bulk of the improvement in terms of context switching. (Heck the
old ATM network protocol used 64 byte packets and I think the maddog anecdote
was 64 byte packets? I have the auditape of all the panels somewherbut haven't
listened to them in years, I don't have a casette player handy).

So waiting for 4k buffers can be a net _loss_ if you're getting less
parallelism. Have you benchmarked the performance penalty here, or are you just
going based on the gnu/grep man page's admonition?

If you care strongly, I can do an if (CFG_TOYBOX_ON_ANDROID) to disable it for
you, but then interactive use would suck and it's unlikely to be a measurable
performance win in your scripts. And I'd much rather do a --block-buffered
command line option to _enable_ it in cases where it turns out to be a thing...

(Also, I'm doing line buffered _input_ and line-at-a-time regex. If grep is a
serious performance bottleneck for you, other grep implementations do block
buffered grep input and work out line breaks after they find matches. That's
probably more of a win than changing output buffering now that we're not
flushing multiple times per line. And yes, xprintf() plus xputc('\n') was
flushing at least twice per line before, my bad. Fixed now.)

Rob



More information about the Toybox mailing list