[Toybox] [PATCH] grep: add --line-buffered and fix regular buffering.

Rob Landley rob at landley.net
Fri Apr 26 12:54:19 PDT 2019


On 4/25/19 5:23 PM, enh wrote:
> On Thu, Apr 25, 2019 at 3:03 PM Rob Landley <rob at landley.net> wrote:
>>
>> On 4/24/19 5:50 PM, enh wrote:
>>> On Wed, Apr 24, 2019 at 3:40 PM Rob Landley <rob at landley.net> wrote:
>>>>
>>>> On 4/23/19 8:06 PM, enh wrote:
>>>>> (sorry. busy.)
>>>>
>>>> Sigh. I'll add a config symbol you can switch on, with help text complaining
>>>> about it.
>>>
>>> well, it's not necessarily grep i'm worried about. i think if all the
>>> other toys are now going to be line-buffered, it doesn't make much
>>> sense to single out grep like this.
>>
>> Neither do I.

P.S. Keep in mind I'm still trying to make "android workstations" a thing. There
really should be end users for this stuff.

>> That said, after writing up a point by point rebuttal to your message, I
>> benchmarked the code (silly to argue about something we haven't tested the speed
>> of) and found out that devuan's grep takes 8 seconds to do something mine takes
>> 20 to do _before_ the output buffering changed. Which is unacceptable and made
>> the argument we were having kinda moot.
> 
> (one thing i was going to say is that line buffering has a few weird
> side-effects. C11's 7.21.3.3 implies that you have to flush
> line-buffered streams any time you read from an unbuffered or
> line-buffered stream, for example.

...why?

I believe you that it says it. What possible purpose could it serve? (Sending
data to yourself through a pipe or fifo wrapped with a FILE *? That's pilot
error... If I specify -c99 to the compiler will it _not_? No, because this is
libc breakage...)

> which would affect your original code.)

That said, the flush gets us back to unbuffered so it's not going to be _worse_
than defaulting to unbuffered.

To be honest, the do_lines() rewrite would redo the input based on filehandles,
and redoing the _output_ based on filehandles is starting to sound less bad all
the time with each new FILE * stupidity. (I initially avoided it because it
seemed unnecessary, then started again because getline() went into posix-2008.)

I really don't _want_ to micromanage this nonsense, neither with manual flushes
sprinkled into who knows how many apps, nor with selecting per-app buffering
mode without a sane default. I want it to just work and stay out of my way, and
if it can't sanely buffer then the Linux pipe infrastructure _can_.

This "tty and non-tty work different" is too clever and breaks less in a really
visible way.

>> So now I'm staring at do_lines() and seeing if I can get it to do large (at
>> least 64k) block reads where it doesn't memcpy() the data out into individual
>> lines but passes a pointer into the buffer where possible. (And this lets me
>> replace get_line(fd) which has been on the todo list forever, although moving
>> the all to do_lines() and a callback is probably the best way...)
> 
> /me shudders having just fixed a bug in NetBSD's code to do this. but,
> yeah, it's probably worth the cost. and we have ASan to keep us
> honest.

Wasn't my first choice either, but a potential >2x performance improvement's
kind of a thing. And the proper thing to do is probably mmap() the whole file
when I can, read at least megabyte chunks when I can't (using the poll() and
read as much as won't block trick, anyway), and maybe fall back to the existing
getline() loop on nommu.

I saw a mailing list post ages ago about other grep implementations using a
streaming regex search where it ignores newlines until it finds a match and then
analyzes the match for line breaks, and I really don't want to change the regex
plumbing. (And with multiple patterns iterating over a 20 megabyte of file
thrashing the cache like that is likely to be way slower than processing a cache
local working set...)

Sigh. I'd hoped to get through the roadmap to 1.0 before worrying too much about
performance tuning, but the project appears to have developed a userbase. :)

Rob



More information about the Toybox mailing list