<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Sep 24, 2022 at 2:31 PM Rob Landley <<a href="mailto:rob@landley.net">rob@landley.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 9/15/22 20:42, Rob Landley wrote:<br>
> On 9/15/22 16:32, enh wrote:<br>
>> ah, but on another box with 2021-10-22 it's broken. so it looks like "real" xxd<br>
>> had the same bug and fixed it recently?<br>
> <br>
> Easy enough to fix, I was just confused by "the one in debian fails the same way"...<br>
> <br>
>> > A simple optimization would be to concat all patterns into one huge regex<br>
>> > pattern with '|'.<br>
>> <br>
>> Which you could do in your pattern creation just as easily as grep could do so<br>
>> internally? Except when I tried it:<br>
>> <br>
>> $ time toybox grep -Ef <(tr '\n' '|' < test_large_10000 | head -c -1) \<br>
>> test_large_10000<br>
>> <br>
>> The result was even _slower_ on glibc (and gave me "out of memory" on musl).<br>
>> <br>
>> fwiw, BSD grep which we used to use before<br>
>> had <a href="https://android.googlesource.com/platform/system/core/+/kitkat-release/toolbox/grep/fastgrep.c#32" rel="noreferrer" target="_blank">https://android.googlesource.com/platform/system/core/+/kitkat-release/toolbox/grep/fastgrep.c#32</a><br>
>> which (although i haven't tested) i always assumed was for this.<br>
> <br>
> Yeah, doing more ourselves is the obvious way to go...<br>
<br>
I punted on most of the utf8 stuff in there. Let me know if I should revisit<br>
that. (Right now I'm treating ANY string has char > 127 in it as "hand off to<br>
the regex engine" instead of using the fast path -F logic. In theory I only need<br>
to do that for -i but see recent email about reordering width zero unicode code<br>
points... if the regex engine does it, it's not my fault when it's wrong. :)<br></blockquote><div><br></div><div>both users i've seen have wanted this for "filenames in a build", so they should be fine. (and, yes, "i18n is someone else's problem" is usually the right choice down at this level.)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
>> I could also sort the patterns by first character and loop through calling the<br>
>> appropriate regexes based on character at string position... which is harder<br>
>> than you think because there could be a | later in the pattern so that<br>
>> optimization only applies to a _subset_ of patterns. (First char being . or * or<br>
>> ( kinda throws that off to, and that optimization would also have to be aware of<br>
>> case insensitivity...)<br>
> <br>
> I was trying to avoid writing another regex implementation, but a fast subset<br>
> one that does parallel cache-local searching for simple patterns and hands off<br>
> to libc for anything complicated isn't that hard. And I can even cheat with a<br>
> bitmask for starting byte:<br>
<br>
What I did was simpler: filter the incoming TT.e list (after it's been split at<br>
\n and TT.f entries collated into it, which the code was already doing) into<br>
fixed and regex categories (ala autodetect when I can apply -F logic, with basic<br>
support for non-unicode case insensitivity, backslash escapes, and "." not in<br>
the first position), and then bucket sort the resulting -F list into an array of<br>
256 linked lists by starting character (punting for now that starting with "."<br>
means it autodetects as regex not fixed: the actual -F flag overrides this), and<br>
then use TT.fixed[c] being NULL to mean nothing can match here. Further<br>
optimization: sorting each per-character list by length in the setup phase means<br>
the actual search logic can stop at the first hit because it's the longest one.<br>
<br>
No bitmask needed: <a href="https://github.com/landley/toybox/commit/a7e49c3c7860" rel="noreferrer" target="_blank">https://github.com/landley/toybox/commit/a7e49c3c7860</a><br>
<br>
(Then it falls back to running the same old regex logic for any non-fixed<br>
patterns, where regexec searches the whole string from start to finish for each<br>
pattern. If there aren't any to loop naturally skips itself traversing zero<br>
entries, if there are a small number the performance hit shouldn't be too bad.)<br>
<br>
It's still slower than the gnu/dammit grep (0.9 seconds vs 0.3 seconds on my<br>
laptop), but hopefully good enough.<br></blockquote><div><br></div><div>heh, yeah, that sounds like plenty :-) i don't know of anyone who's actually timing this --- the complaint was just about the pathological behavior!</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Try now?<br></blockquote><div><br></div><div>(update in progress. given that yochiang and i are in horribly incompatible timezones, we might be a while getting back to you...)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Rob<br>
<br>
P.S. I added a speed test to grep.test (5 second timeout, should still pass on<br>
the original raspberry pi model A from 2012 not that I have one), using base64<br>
on a long seq to get the many unique test lines instead of xxd /dev/urandom,<br>
because urandom is slow producing nontrivial amounts of data. Trying to feed seq<br>
into sha3sum or similar was equally slow: running the test should not be an<br>
order of magnitude faster than generating the test data.<br></blockquote><div><br></div><div>certainly fine on my rpi400...</div><div><br></div><div> PASS: grep speed</div><br></div><div class="gmail_quote">specifically it takes:</div><div class="gmail_quote"><br></div><div class="gmail_quote">real 0m0.609s<br>user 0m0.581s<br>sys 0m0.177s<br></div><div class="gmail_quote"><br></div><div class="gmail_quote">that seems like a good enough margin of error for slower rpis :-)</div><div class="gmail_quote"><br></div></div>