[Toybox] [PATCH] grep: add --line-buffered and fix regular buffering.

Tue Apr 23 18:06:49 PDT 2019

(sorry. busy.)

On Tue, Apr 16, 2019 at 9:50 PM Rob Landley <rob at landley.net> wrote:
>
> On 4/16/19 7:42 PM, enh wrote:
> >> Line buffered is the behavior I've been assuming, but I see glibc defaults to
> >> "unbuffered"... in which case what is FILE * _for_ again? A strrchr() on each
> >> buffer with a flush to that point isn't that expensive, and most of the stuff
> >> that cares about micromanaging buffers is already using filehandles instead of
> >> FILE * anyway...
> >
> > i think the problem here is that you're looking at this through the
> > wrong lens. i think your mental model is "interactive user", whereas
> > i'm thinking of them *and* of scripts/builds.
>
> I'm thinking of both, but didn't see how scripts should be significantly
> penalized here?

because you've taken away the I/O buffering and replaced it with
*extra* work to look for \n?

> > the traditional behavior of trusting the C library's buffering and
> > having just one exception (that i've ever seen) in grep _is_ sane.
> >
> > i don't know where you misapprehensions about libc's behavior are
> > coming from...
> When I first met Denys Vlasenko in person (the guy I handed busybox
> maintainerhip off to), it's because he attended a US conferene talking about how
> he'd fixed a glibc bug that caused statically linked binaries with --gc-sections
> to not flush stdout on exit.
>
>   https://elinux.org/images/2/2d/ELC2010-gc-sections_Denys_Vlasenko.pdf
>   https://bootlin.com/pub/video/2010/elc/elc2010-vlasenko-dead-code.ogv
>
> (I hit it long before then, but he root caused it. It was of course an issue
> where the glibc developers and the gcc developers were pointing at each other
> and each telling the other to change, and as a result the issue didn't get fixed
> for something like 5 years. Of course I think that was back when Ulrich "static
> linking considered harmful" drepper was still maintainer,
> https://www.akkadia.org/drepper/no_static_linking.html. He was replaced by a
> committee that broke up in 2012 https://lwn.net/Articles/488847/ but it's still
> a gnu/dammit project and that's bad.)
>
> The "tee" command broke so it only passed through a 4k block of data at a time
> (rather than unbuffered output) _three_times_ over the years I used it. As in
> broke got fixed and broke again, 3 times, all on Linux. And that's not counting
> busybox breaking it and me fixing it there:
>
>   http://lists.busybox.net/pipermail/busybox/2004-May/045755.html
>
> And yes, the fix was to replace fread(), which was 4k buffering, with read().

i'm not claiming you never want line buffering. i'm claiming it's not
a good default.

> Here's a quick grep of my blog of _just_ fighting with the "tee" command over
> the years, not all of which was glibc of course:
>
> https://landley.net/notes-2005.html#23-08-2005
> https://landley.net/notes-2006.html#06-03-2007
> https://landley.net/notes-2008.html#07-10-2008
> https://landley.net/notes-2008.html#03-07-2008
> https://landley.net/notes-2011.html#24-08-2011
>
> The deepest dive to fix a tee issue slightly predates the current iteration of
> my blog:
>
> https://marc.info/?l=user-mode-linux-devel&m=111056135413631&w=2
>
> And the most _recent_ time glibc broke this was 2017:
>
>   https://landley.net/notes-2017.html#08-04-2017
>
> And yes, it was "stdout winds up fully buffered for some unknown reason, same
> build against musl did not, fix it with gratuitous flushes"...
>
> > they're not true of bionic, nor any version of glibc
> > i've ever seen.
>
> How about if I explicitly ask for _unbuffered_ in the common code (which
> according to you should be a NOP) and then implemnt --line-buffered for grep?

yeah, that should be the minimum for me to move forward, but i think
we'll be back here later to fix this properly.

> > POSIX explicitly says:
> >
> >   When opened, the standard error stream is not fully buffered; the
> > standard input and standard output streams are fully buffered if and
> > only if the stream can be determined not to refer to an interactive
> > device.
>
> So they don't specify whether it's linebuffered or unbuffered, just not _fully_
> buffered. And if stdout is fully buffered when you pipe it to tee you get the 4k
> blocking problem again. So the bug kept showing up because the standard said the
> bug should show up, great.
>
> Except... the fix in the busybox case was on the _read_ side? It's fully
> buffering INPUT by default too? That's INSANE. If 4k of data is waiting when you
> read, you get/process 4k. If less is waiting, PROCESS WHAT YOU GOT...

well, that's an unrelated issue. the order of the two size_t arguments
to fread is really important --- it's going to do exactly what you
tell it, and they're two pretty different behaviors.

> > that's the only behavior i've ever seen implemented, and that's the
> > behavior you want. scripts/builds don't get pessimized performance,
> > and interactive users don't have to wait for 1KiB/4KiB/8KiB.
> >
> > (of course, this assumes that the application isn't flushing way too
> > often, which i claim is the real and only bug here.)
>
> Which I just removed. I agree, it was flushing way too often, because I was
> being more paranoid than necessary about error checking.
>
> But "stdout is unbuffered" and "we flush(stdout) after write" is the
> _same_thing_ isn't it?

no, because unbuffered is really expensive.

one of my first patches was fixing this bug in dmesg, if you remember :-)

> If it's fully buffered, then "grep | tee" doesn't show you lines as it finds
> them. (Not with all those extra flushes removed.) If it's unbuffered, we might
> as well have left xprintf() flushing each write.

that's WAI, though. and exactly why --line-buffered exists. for that
one special case that you normally don't need.

> Fully buffered pipeline output means "watch" will never progressively display
> anything unless it fakes a pty to disable that behavior. (Remember how I rewrote
> watch so it restarts in a timely fashion rather than blocking arbitrarily long
> waiting for output to finish and _then_ timing out? I'd have to _undo_ that if I
> humor this tty weirdness, 80x25=2000 a full screen is half this buffer...)
>
> > strictly speaking, glibc has a helper program (stdbuf(1)) that lets
> > you modify the buffering for the program it calls, but no other libc
> > (that i know of) has anything similar, and i've never seen the glibc
> > one used. as far as i know, the grep --line-buffered option, which is
> > in both BSD and GNU, is the only place where a command-line utility
> > explicitly offers fine control. and since Android's on-device grep has
> > always been [Net]BSD and the build grep has been GNU grep, both
> > device- and host-side expect this behavior, and control it where
> > necessary with --line-buffered.
>
> When you call write() it makes a system call, invokes the scheduler, copies data
> between process contexts, often invokes another process that's been waiting for
> CPU longer, there's cache flushing all over the place (especially with the
> spectre mitigations), and it can have a huge impact on performance if your
> bufering granularity is wrong.

yes, which is why i'm asking you to not default to line buffering,
which effectively makes the buffer very small *and* adds extra
overhead to decide when to flush.

> At LinuxWorld Expo 2001 Jon "Maddog" Hall told an anecdote in one of his panels
> about an old mainframe batch job writing its output to reel-to-reel tape a byte
> at a time, each time 1) spinning the tape up to speed skipping tape as it did,
> 2) writing a data packet with a header and footer around the single byte of
> data, and 3) spinning the tape back down skiping more tape. This ate a dozen
> tapes and took hours with someone standing by to switch tapes. He changed the
> code so it wrote out 64 bytes of data at a time, and suddenly everything fit on
> a single tape and was done in 15 minutes, and best of all it could run in the
> middle of the night with nobody standing by and they could collect the tape in
> the morning.
>
> I have encountered variants of this issue dozens of times over the years, in
> many different contexts. It's why TCP/IP has nagle's agorithm, which can be sung
> to the dreidl song. It's one of the two hard problems in computer science
> (naming things, cache invalidation, and fencepost errors).

(this is why i'm so confused by this thread and resent the patch
because i thought you'd come over to my side --- you keep seeming to
argue *my* side.)

> What I'm trying to avoid is extra write() system calls and the resulting context
> switches and cache flushing, which seem likely to penalize performance far more
> than cache-local fidding with an 8k buffer.
>
> The "line buffered" mode boils down to strrchr('\n') on the data we just wrote
> and if we find one flush() to that point on the way out. It should never cause
> _more_ write() calls than unbuffered mode (except in the case where you
> explicitly fflush() afterwards to output a partial line).

i'm not asking for unbuffered. no-one is asking for unbuffered. only
stderr is unbuffered. here's copy & paste direct from bionic:

    MAKE_STD_STREAM(__SRD, STDIN_FILENO),
    MAKE_STD_STREAM(__SWR, STDOUT_FILENO),
    MAKE_STD_STREAM(__SWR|__SNBF, STDERR_FILENO),

(__SNBF is the "this is unbuffered".)

stdout is buffered. that's what the fstat you see in strace is for: if
it succeeds, you get an st_blksize buffer. if not, BUFSIZ.

> So you still get the vast majority of the performance boost from _not_ having
> multiple writes per line when grep is calling a bunch of little printf(), but an
> average output of 50 chars or so is still 1/50th as many system calls and should
> be the bulk of your performance gain.
>
> > there are definitely places where traditional tools do something
> > stupid, and no one actually wants the stupid, and i'm happy with those
> > deviations. but this is a place where i think toybox is behaving
> > differently to no advantage and some disadvantage, whereas stdio left
> > to its own devices (without constant hammering of the flush button)
> > does a good enough job for both distinct use cases.
>
> Sigh. My next guess would be a --set-block-buffered option for grep (which can
> call setvbuf(_IOFBF) to put it back easily enough), but that's not what the
> other one has and I doubt you want that.
>
> The thing is even if grep were special-cased to fully buffer into pipelines by
> default (and thus piping the output to tee would be interactively broken), fully
> buffering is seriously _not_ a performance boost in a lot of cases. In my linux
> source git checkout I just did (using the ubuntu 14.04 versions of everything):
>
>   $ time bash -c "grep -ir max_range --line-buffered | head -n 2"
>   kernel/dma/debug.c:   unsigned int max_range = dma_get_max_seg_size(ref->dev);
>   kernel/dma/debug.c:   while (range <= max_range) {
>
>   real  0m3.289s
>   user  0m3.268s
>   sys   0m0.022s
>
>   $ time bash -c "grep -ir max_range | head -n 2"
>   kernel/dma/debug.c:   unsigned int max_range = dma_get_max_seg_size(ref->dev);
>   kernel/dma/debug.c:   while (range <= max_range) {
>
>   real  9m5.821s
>   user  4m8.223s
>   sys   0m7.350s
>
> Yes that grepped through the .git directory (with history back to 0.0.1), but
> it's still 3 seconds vs 9 minutes on a 2.4ghz core i7. Because it chugs through
> gigabytes more input confirming there aren't more hits before producing _any_
> output. (And that was the _second_ run of the big one, to let the cache populate
> so you don't blame my spinning rust. The short one was uncached.)
>
> On an SMP system the faster the other program can receive the data the faster it
> can start processing it. The kernel guys redid the pipe infrastructure
> extensively a few years back to mitigate the cost of ping-ponging between
> processes, and going from 1 byte block transfers to ~60 byte block transfers
> should be the bulk of the improvement in terms of context switching. (Heck the
> old ATM network protocol used 64 byte packets and I think the maddog anecdote
> was 64 byte packets? I have the auditape of all the panels somewherbut haven't
> listened to them in years, I don't have a casette player handy).
>
> So waiting for 4k buffers can be a net _loss_ if you're getting less
> parallelism. Have you benchmarked the performance penalty here, or are you just
> going based on the gnu/grep man page's admonition?

for interactive use, i agree that this is fine (and possibly better in
some places than the previous excessive flushing).

for non-interactive use, no, i don't currently have a specific case
where this is problematic. i'm just worried about being needlessly
wasteful, and worried about my ability to measure any "death by a
thousand cuts" lossage.

> If you care strongly, I can do an if (CFG_TOYBOX_ON_ANDROID) to disable it for
> you, but then interactive use would suck and it's unlikely to be a measurable
> performance win in your scripts. And I'd much rather do a --block-buffered
> command line option to _enable_ it in cases where it turns out to be a thing...
>
> (Also, I'm doing line buffered _input_ and line-at-a-time regex. If grep is a
> serious performance bottleneck for you, other grep implementations do block
> buffered grep input and work out line breaks after they find matches. That's
> probably more of a win than changing output buffering now that we're not
> flushing multiple times per line. And yes, xprintf() plus xputc('\n') was
> flushing at least twice per line before, my bad. Fixed now.)
>
> Rob