[Toybox] Slow grep

Yi-yo Chiang yochiang at google.com
Mon Sep 26 23:47:26 PDT 2022


(I suffered a system failure during the weekend and lost all my browser
tabs and consequently lost track of all my work...)

Might be a stupid question, but what's so special about ". in first
position"? Doesn't ANY '.' in ANY position (sans <backslash><dot>) indicate
a regex (not fixed pattern)?


On Tue, Sep 27, 2022 at 4:45 AM enh <enh at google.com> wrote:

>
>
> On Sat, Sep 24, 2022 at 2:31 PM Rob Landley <rob at landley.net> wrote:
>
>> On 9/15/22 20:42, Rob Landley wrote:
>> > On 9/15/22 16:32, enh wrote:
>> >> ah, but on another box with 2021-10-22 it's broken. so it looks like
>> "real" xxd
>> >> had the same bug and fixed it recently?
>> >
>> > Easy enough to fix, I was just confused by "the one in debian fails the
>> same way"...
>> >
>> >>     > A simple optimization would be to concat all patterns into one
>> huge regex
>> >>     > pattern with '|'.
>> >>
>> >>     Which you could do in your pattern creation just as easily as grep
>> could do so
>> >>     internally? Except when I tried it:
>> >>
>> >>     $ time toybox grep -Ef <(tr '\n' '|' < test_large_10000 | head -c
>> -1) \
>> >>       test_large_10000
>> >>
>> >>     The result was even _slower_ on glibc (and gave me "out of memory"
>> on musl).
>> >>
>> >> fwiw, BSD grep which we used to use before
>> >> had
>> https://android.googlesource.com/platform/system/core/+/kitkat-release/toolbox/grep/fastgrep.c#32
>> >> which (although i haven't tested) i always assumed was for this.
>> >
>> > Yeah, doing more ourselves is the obvious way to go...
>>
>> I punted on most of the utf8 stuff in there. Let me know if I should
>> revisit
>> that. (Right now I'm treating ANY string has char > 127 in it as "hand
>> off to
>> the regex engine" instead of using the fast path -F logic. In theory I
>> only need
>> to do that for -i but see recent email about reordering width zero
>> unicode code
>> points... if the regex engine does it, it's not my fault when it's wrong.
>> :)
>>
>
> both users i've seen have wanted this for "filenames in a build", so they
> should be fine. (and, yes, "i18n is someone else's problem" is usually the
> right choice down at this level.)
>
>
>> >>     I could also sort the patterns by first character and loop through
>> calling the
>> >>     appropriate regexes based on character at string position... which
>> is harder
>> >>     than you think because there could be a | later in the pattern so
>> that
>> >>     optimization only applies to a _subset_ of patterns. (First char
>> being . or * or
>> >>     ( kinda throws that off to, and that optimization would also have
>> to be aware of
>> >>     case insensitivity...)
>> >
>> > I was trying to avoid writing another regex implementation, but a fast
>> subset
>> > one that does parallel cache-local searching for simple patterns and
>> hands off
>> > to libc for anything complicated isn't that hard. And I can even cheat
>> with a
>> > bitmask for starting byte:
>>
>> What I did was simpler: filter the incoming TT.e list (after it's been
>> split at
>> \n and TT.f entries collated into it, which the code was already doing)
>> into
>> fixed and regex categories (ala autodetect when I can apply -F logic,
>> with basic
>> support for non-unicode case insensitivity, backslash escapes, and "."
>> not in
>> the first position), and then bucket sort the resulting -F list into an
>> array of
>> 256 linked lists by starting character (punting for now that starting
>> with "."
>> means it autodetects as regex not fixed: the actual -F flag overrides
>> this), and
>> then use TT.fixed[c] being NULL to mean nothing can match here. Further
>> optimization: sorting each per-character list by length in the setup
>> phase means
>> the actual search logic can stop at the first hit because it's the
>> longest one.
>>
>> No bitmask needed: https://github.com/landley/toybox/commit/a7e49c3c7860
>>
>> (Then it falls back to running the same old regex logic for any non-fixed
>> patterns, where regexec searches the whole string from start to finish
>> for each
>> pattern. If there aren't any to loop naturally skips itself traversing
>> zero
>> entries, if there are a small number the performance hit shouldn't be too
>> bad.)
>>
>> It's still slower than the gnu/dammit grep (0.9 seconds vs 0.3 seconds on
>> my
>> laptop), but hopefully good enough.
>>
>
> heh, yeah, that sounds like plenty :-) i don't know of anyone who's
> actually timing this --- the complaint was just about the pathological
> behavior!
>
>
>> Try now?
>>
>
> (update in progress. given that yochiang and i are in horribly
> incompatible timezones, we might be a while getting back to you...)
>
>
>> Rob
>>
>> P.S. I added a speed test to grep.test (5 second timeout, should still
>> pass on
>> the original raspberry pi model A from 2012 not that I have one), using
>> base64
>> on a long seq to get the many unique test lines instead of xxd
>> /dev/urandom,
>> because urandom is slow producing nontrivial amounts of data. Trying to
>> feed seq
>> into sha3sum or similar was equally slow: running the test should not be
>> an
>> order of magnitude faster than generating the test data.
>>
>
> certainly fine on my rpi400...
>
>  PASS: grep speed
>
> specifically it takes:
>
> real 0m0.609s
> user 0m0.581s
> sys 0m0.177s
>
> that seems like a good enough margin of error for slower rpis :-)
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20220927/dfc55332/attachment-0001.htm>


More information about the Toybox mailing list