<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Sep 24, 2022 at 2:31 PM Rob Landley <<a href="mailto:rob@landley.net">rob@landley.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 9/15/22 20:42, Rob Landley wrote:<br>

> On 9/15/22 16:32, enh wrote:<br>

>> ah, but on another box with 2021-10-22 it's broken. so it looks like "real" xxd<br>

>> had the same bug and fixed it recently?<br>

> <br>

> Easy enough to fix, I was just confused by "the one in debian fails the same way"...<br>

> <br>

>>     > A simple optimization would be to concat all patterns into one huge regex<br>

>>     > pattern with '|'.<br>

>> <br>

>>     Which you could do in your pattern creation just as easily as grep could do so<br>

>>     internally? Except when I tried it:<br>

>> <br>

>>     $ time toybox grep -Ef <(tr '\n' '|' < test_large_10000 | head -c -1) \<br>

>>       test_large_10000<br>

>> <br>

>>     The result was even _slower_ on glibc (and gave me "out of memory" on musl).<br>

>> <br>

>> fwiw, BSD grep which we used to use before<br>

>> had <a href="https://android.googlesource.com/platform/system/core/+/kitkat-release/toolbox/grep/fastgrep.c#32" rel="noreferrer" target="_blank">https://android.googlesource.com/platform/system/core/+/kitkat-release/toolbox/grep/fastgrep.c#32</a><br>

>> which (although i haven't tested) i always assumed was for this.<br>

> <br>

> Yeah, doing more ourselves is the obvious way to go...<br>

<br>

I punted on most of the utf8 stuff in there. Let me know if I should revisit<br>

that. (Right now I'm treating ANY string has char > 127 in it as "hand off to<br>

the regex engine" instead of using the fast path -F logic. In theory I only need<br>

to do that for -i but see recent email about reordering width zero unicode code<br>

points... if the regex engine does it, it's not my fault when it's wrong. :)<br></blockquote><div><br></div><div>both users i've seen have wanted this for "filenames in a build", so they should be fine. (and, yes, "i18n is someone else's problem" is usually the right choice down at this level.)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

>>     I could also sort the patterns by first character and loop through calling the<br>

>>     appropriate regexes based on character at string position... which is harder<br>

>>     than you think because there could be a | later in the pattern so that<br>

>>     optimization only applies to a _subset_ of patterns. (First char being . or * or<br>

>>     ( kinda throws that off to, and that optimization would also have to be aware of<br>

>>     case insensitivity...)<br>

> <br>

> I was trying to avoid writing another regex implementation, but a fast subset<br>

> one that does parallel cache-local searching for simple patterns and hands off<br>

> to libc for anything complicated isn't that hard. And I can even cheat with a<br>

> bitmask for starting byte:<br>

<br>

What I did was simpler: filter the incoming TT.e list (after it's been split at<br>

\n and TT.f entries collated into it, which the code was already doing) into<br>

fixed and regex categories (ala autodetect when I can apply -F logic, with basic<br>

support for non-unicode case insensitivity, backslash escapes, and "." not in<br>

the first position), and then bucket sort the resulting -F list into an array of<br>

256 linked lists by starting character (punting for now that starting with "."<br>

means it autodetects as regex not fixed: the actual -F flag overrides this), and<br>

then use TT.fixed[c] being NULL to mean nothing can match here. Further<br>

optimization: sorting each per-character list by length in the setup phase means<br>

the actual search logic can stop at the first hit because it's the longest one.<br>

<br>

No bitmask needed: <a href="https://github.com/landley/toybox/commit/a7e49c3c7860" rel="noreferrer" target="_blank">https://github.com/landley/toybox/commit/a7e49c3c7860</a><br>

<br>

(Then it falls back to running the same old regex logic for any non-fixed<br>

patterns, where regexec searches the whole string from start to finish for each<br>

pattern. If there aren't any to loop naturally skips itself traversing zero<br>

entries, if there are a small number the performance hit shouldn't be too bad.)<br>

<br>

It's still slower than the gnu/dammit grep (0.9 seconds vs 0.3 seconds on my<br>

laptop), but hopefully good enough.<br></blockquote><div><br></div><div>heh, yeah, that sounds like plenty :-) i don't know of anyone who's actually timing this --- the complaint was just about the pathological behavior!</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Try now?<br></blockquote><div><br></div><div>(update in progress. given that yochiang and i are in horribly incompatible timezones, we might be a while getting back to you...)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Rob<br>

<br>

P.S. I added a speed test to grep.test (5 second timeout, should still pass on<br>

the original raspberry pi model A from 2012 not that I have one), using base64<br>

on a long seq to get the many unique test lines instead of xxd /dev/urandom,<br>

because urandom is slow producing nontrivial amounts of data. Trying to feed seq<br>

into sha3sum or similar was equally slow: running the test should not be an<br>

order of magnitude faster than generating the test data.<br></blockquote><div><br></div><div>certainly fine on my rpi400...</div><div><br></div><div> PASS: grep speed</div><br></div><div class="gmail_quote">specifically it takes:</div><div class="gmail_quote"><br></div><div class="gmail_quote">real      0m0.609s<br>user  0m0.581s<br>sys   0m0.177s<br></div><div class="gmail_quote"><br></div><div class="gmail_quote">that seems like a good enough margin of error for slower rpis :-)</div><div class="gmail_quote"><br></div></div>