[Toybox] Slow grep

Rob Landley rob at landley.net
Sat Sep 24 14:39:43 PDT 2022


On 9/15/22 20:42, Rob Landley wrote:
> On 9/15/22 16:32, enh wrote:
>> ah, but on another box with 2021-10-22 it's broken. so it looks like "real" xxd
>> had the same bug and fixed it recently?
> 
> Easy enough to fix, I was just confused by "the one in debian fails the same way"...
> 
>>     > A simple optimization would be to concat all patterns into one huge regex
>>     > pattern with '|'.
>> 
>>     Which you could do in your pattern creation just as easily as grep could do so
>>     internally? Except when I tried it:
>> 
>>     $ time toybox grep -Ef <(tr '\n' '|' < test_large_10000 | head -c -1) \
>>       test_large_10000
>> 
>>     The result was even _slower_ on glibc (and gave me "out of memory" on musl).
>> 
>> fwiw, BSD grep which we used to use before
>> had https://android.googlesource.com/platform/system/core/+/kitkat-release/toolbox/grep/fastgrep.c#32
>> which (although i haven't tested) i always assumed was for this.
> 
> Yeah, doing more ourselves is the obvious way to go...

I punted on most of the utf8 stuff in there. Let me know if I should revisit
that. (Right now I'm treating ANY string has char > 127 in it as "hand off to
the regex engine" instead of using the fast path -F logic. In theory I only need
to do that for -i but see recent email about reordering width zero unicode code
points... if the regex engine does it, it's not my fault when it's wrong. :)

>>     I could also sort the patterns by first character and loop through calling the
>>     appropriate regexes based on character at string position... which is harder
>>     than you think because there could be a | later in the pattern so that
>>     optimization only applies to a _subset_ of patterns. (First char being . or * or
>>     ( kinda throws that off to, and that optimization would also have to be aware of
>>     case insensitivity...)
> 
> I was trying to avoid writing another regex implementation, but a fast subset
> one that does parallel cache-local searching for simple patterns and hands off
> to libc for anything complicated isn't that hard. And I can even cheat with a
> bitmask for starting byte:

What I did was simpler: filter the incoming TT.e list (after it's been split at
\n and TT.f entries collated into it, which the code was already doing) into
fixed and regex categories (ala autodetect when I can apply -F logic, with basic
support for non-unicode case insensitivity, backslash escapes, and "." not in
the first position), and then bucket sort the resulting -F list into an array of
256 linked lists by starting character (punting for now that starting with "."
means it autodetects as regex not fixed: the actual -F flag overrides this), and
then use TT.fixed[c] being NULL to mean nothing can match here. Further
optimization: sorting each per-character list by length in the setup phase means
the actual search logic can stop at the first hit because it's the longest one.

No bitmask needed: https://github.com/landley/toybox/commit/a7e49c3c7860

(Then it falls back to running the same old regex logic for any non-fixed
patterns, where regexec searches the whole string from start to finish for each
pattern. If there aren't any to loop naturally skips itself traversing zero
entries, if there are a small number the performance hit shouldn't be too bad.)

It's still slower than the gnu/dammit grep (0.9 seconds vs 0.3 seconds on my
laptop), but hopefully good enough.

Try now?

Rob

P.S. I added a speed test to grep.test (5 second timeout, should still pass on
the original raspberry pi model A from 2012 not that I have one), using base64
on a long seq to get the many unique test lines instead of xxd /dev/urandom,
because urandom is slow producing nontrivial amounts of data. Trying to feed seq
into sha3sum or similar was equally slow: running the test should not be an
order of magnitude faster than generating the test data.


More information about the Toybox mailing list