[Toybox] [PATCH] grep: add --line-buffered and fix regular buffering.

enh enh at google.com
Thu Apr 25 18:05:42 PDT 2019


On Thu, Apr 25, 2019 at 3:23 PM enh <enh at google.com> wrote:
>
> On Thu, Apr 25, 2019 at 3:03 PM Rob Landley <rob at landley.net> wrote:
> >
> > On 4/24/19 5:50 PM, enh wrote:
> > > On Wed, Apr 24, 2019 at 3:40 PM Rob Landley <rob at landley.net> wrote:
> > >>
> > >> On 4/23/19 8:06 PM, enh wrote:
> > >>> (sorry. busy.)
> > >>
> > >> Sigh. I'll add a config symbol you can switch on, with help text complaining
> > >> about it.
> > >
> > > well, it's not necessarily grep i'm worried about. i think if all the
> > > other toys are now going to be line-buffered, it doesn't make much
> > > sense to single out grep like this.
> >
> > Neither do I.
> >
> > That said, after writing up a point by point rebuttal to your message, I
> > benchmarked the code (silly to argue about something we haven't tested the speed
> > of) and found out that devuan's grep takes 8 seconds to do something mine takes
> > 20 to do _before_ the output buffering changed. Which is unacceptable and made
> > the argument we were having kinda moot.
>
> (one thing i was going to say is that line buffering has a few weird
> side-effects. C11's 7.21.3.3 implies that you have to flush
> line-buffered streams any time you read from an unbuffered or
> line-buffered stream, for example. which would affect your original
> code.)
>
> > So now I'm staring at do_lines() and seeing if I can get it to do large (at
> > least 64k) block reads where it doesn't memcpy() the data out into individual
> > lines but passes a pointer into the buffer where possible. (And this lets me
> > replace get_line(fd) which has been on the todo list forever, although moving
> > the all to do_lines() and a callback is probably the best way...)
>
> /me shudders having just fixed a bug in NetBSD's code to do this. but,
> yeah, it's probably worth the cost. and we have ASan to keep us
> honest.

that said... remember to profile first. (i won't say anything about my
original --line-buffered because i don't know whether your test case
is input-heavy, output-heavy, or both.)

NetBSD probably could/should be moved over to stdio's getline(3) --- i
think it just didn't exist at the time. (and if we could only get you
to trust stdio...)

having written a parallel grep in Java almost 20 years ago, mmap isn't
an obvious win. it's very dependent on what you're searching (number
and size of files), and to some extent what you're searching for. iirc
https://blog.burntsushi.net/ripgrep/ talks about this with much more
modern data, and they did actually go with a heuristic to guess when
they should/shouldn't use mmap. (glibc's stdio will use mmap in some
circumstances. i don't know of any other libc that does.)

what NetBSD grep (which is what we're coming from) _does_ spend a lot
of code on is avoiding regexec(3). they recognize easy cases (like
"you could have used -F") and handle them more cheaply. in an ideal
world, that would be in the regex implementation, but never having
tried i don't actually know how realistic that is. (if i had the time,
i'd love to compare against RE2.)

> > I might even be able to get it to mmap() and mremap() the data traversing down a
> > file without too intrusive a second path. (Does madvise(MADV_WILLNEED) persist
> > across mremaps of the same mapping, or do I need to call it again?)
> >
> > I _also_ have a todo item WAY down my list that if toysh has a "find | grep"
> > pipeline and both are toybox commands it can automatically -print0 -0 on it, but
> > I'm not sure the proper semantics for triggering that. (Or saying it _shouldn't_
> > do that, which is why it's way down on the list...)
> >
> > >> Does the attached work/suffice? (I _think_ setlinebuf(stdout) followed by
> > >> setvbuf() something else is legal as long as we haven't output anything to the
> > >> stream yet? But the man page is unclear. Presumably the admonition is because
> > >> pending unflushed data in the buffer would get discarded, but they don't _say_...)
> > >
> > > (POSIX explicitly says that it's only valid before any output, but
> > > bionic [and BSDs] flush output data and discard input data if you call
> > > setvbuf [and the others are implemented in terms of setvbuf].)
> >
> > musl doesn't appear bothered either. I read that as saying we can do it twice
> > before any output, but would be amazed if glibc ever tested that, yet care less
> > and less about breaking anything with "gnu" on it each year. It comes pre-broken
> > by the gnu all over it.
> >
> > Rob



More information about the Toybox mailing list