[Toybox] [PATCH] Start replacing get_line() with getline().

Tue Jul 23 13:52:25 PDT 2019

On 7/22/19 6:28 PM, enh wrote:
>> Longer-term the input batching is a reasonable performance win, something you
>> care about in grep. And that rewrites xgetline() and xgetdelim() wrappers around
>> a new function anyway, and as long as we're doing that we may as well change the
>> API to a) always return a malloced string, b) optionally return the length of
>> the string...
> 
> i'm not sure what you mean here. the underlying getdelim()/getline()
> does this (via the buffer in the FILE*).

Said buffer varies from 256 bytes to 8k in the wild. An input size more like 64k
(or larger is what other implementations do to improve performance. (Although
what _they_ do is run the regex over the raw input buffer and then figure out
line boundaries after they've found matches...)

I still have a todo item to speed up grep, and both the input and the output
buffer sides could use work. (The output buffer size has UI impact I need to
spend more time on. Speeding up the input buffer side does not, that I am aware of.)

> it's the get_line() that
> we're removing that _doesn't_ buffer input. switching to getline() [or
> loopfiles_lines, which uses it] gets you all the way to where you want
> to be from that perspective.

Sure. That's low hanging fruit. I'm just pondering other changes that might be
worth making here. (If I'm going to do it for grep, can I genericize it?) But I
haven't had time to do them this year and am not saying to hold up other changes
for them. Just "if we're in the area, I've been thinking about..."

The thinking went "if we're doing a getline() wrapper that doesn't need size_t,
what should it look like". I sat down to try it out yesterday, and decided it
was a bigger can of worms than I wanted to open just now.

>> Ah, wait. There ARE times we care about the delimiter being there or not: the
>> last line isn't guaranteed to have it and there are subtle behavior changes in
>> sed and such (which I implemented!) based on this. (It's related to the
>> sed-of-a-binary thing, you don't want to add a NUL to the end of the file if
>> there wasn't one...)
> 
> on the [i think valid] assumption that we only need '\n' or '\0' as
> delimiters, a flag for that and a flag for whether or not to keep the> delimiter in loopfiles_lines seems like it would be more than enough
> to be getting on with? better than where we are today, anyway.

Pushing more stuff to loopfiles_lines() means _it_ could do fancy blocking
behind the scenes. That's one of my notes in this todo item already. :)

> apart from the fact a few locals need to be promoted to [TT.] globals,
> loopfile_lines() seems strictly like an improvement for code currently
> using getline() too?

It was intended to be.

> the attached patch switches over nl.c, for
> example, and it's less code and a larger scope for the reuse of the
> getline() buffer (even ignoring the pessimal use of getline() in nl.c
> that allocated+deallocated a new buffer per line).
> 
> [PATCH] nl: switch from getline() to loopfiles_lines().

Hmmm, as long as we've got the length anyway possibly regexec0() would be
slightly faster, since it uses REG_STARTEND? (Otherwise you're putting it in a
global to avoid the repeated strlen, and then regexec just strlens anyway.

(I haven't really done a lot of performance tweaks yet. Still filling out the
functionality roadmap. The extra overhead to set up the structure to pass in the
data might be a net loss, I really want to profile stuff and have real world
test loads I'd care about optimizing...)

But yeah, the byte at a time get_line() read is pessimal and should go byebye.
It was a "quick get this working fix it later" stub written back before
posix-2008 came out.

Rob