<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Sep 15, 2022 at 1:45 PM Rob Landley <<a href="mailto:rob@landley.net">rob@landley.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 9/15/22 07:30, Yi-yo Chiang via Toybox wrote:<br>

> grep is slow when the number of patterns is large.<br>

...<br>

>   xxd -p -c 0 -l 40 /dev/urandom<br>

<br>

Huh, WHY does that produce two lines of output?<br>

<br>

$ xxd -p -c 0 -l 40 /dev/urandom<br>

1fcf13e1b4844ba209fb9958bde26a13577c577744f1b1290240d03f4f8e<br>

644fd0687c39b1aa8a68<br>

<br>

Behavior is consistent between toybox xxd and debian's, but Elliott sent in the<br>

xxd implementation and I don't use it much, so... Feeding in -c 40 and -c 80<br>

make no difference?<br></blockquote><div><br></div><div>i think that's actually a bug caused by this:</div><div><br></div><div>


<p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:14px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span class="gmail-s1" style="font-variant-ligatures:no-common-ligatures"><span class="gmail-Apple-converted-space">  </span>// Plain style is 30 bytes/line, no grouping.</span></p>

<p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:14px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span class="gmail-s1" style="font-variant-ligatures:no-common-ligatures"><span class="gmail-Apple-converted-space">  </span>if (FLAG(p)) TT.c = TT.g = 30;</span></p></div><div><br></div><div>should presumably be </div><div> </div><div>  // Plain style is 30 bytes/line, no grouping.<br>  if (FLAG(p)) {</div><div>    if (!TT.c) TT.c = 30;</div><div>    if (!TT.g) TT.g = 30;<br>  }</div><div><br></div><div>?</div><div><br></div><div>certainly "real" xxd works for me on macos and debian, both of which have the same version of xxd:</div><div><br></div><div>~$ xxd --version<br>xxd 2022-01-14 by Juergen Weigert et al.<br></div><div>~$ xxd -p -c 0 -l 40 /dev/urandom<br>ac160632955aa9d938e60d3533cbcf0febb4decdd12f130e415913ff1fe6e2abcaf7c4a8e980de7a</div><div><br></div><div>ah, but on another box with 2021-10-22 it's broken. so it looks like "real" xxd had the same bug and fixed it recently?</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Huh. Oh well, tangent. I'll assume the test you sent me is the test you meant to<br>

send me.<br>

<br>

> A simple optimization would be to concat all patterns into one huge regex<br>

> pattern with '|'.<br>

<br>

Which you could do in your pattern creation just as easily as grep could do so<br>

internally? Except when I tried it:<br>

<br>

$ time toybox grep -Ef <(tr '\n' '|' < test_large_10000 | head -c -1) \<br>

  test_large_10000<br>

<br>

The result was even _slower_ on glibc (and gave me "out of memory" on musl).<br></blockquote><div><br></div><div>fwiw, BSD grep which we used to use before had <a href="https://android.googlesource.com/platform/system/core/+/kitkat-release/toolbox/grep/fastgrep.c#32">https://android.googlesource.com/platform/system/core/+/kitkat-release/toolbox/grep/fastgrep.c#32</a> which (although i haven't tested) i always assumed was for this.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> I've hacked something together locally to do this, and the<br>

> improvement is evident (runtime reduced to ~2s).<br>

<br>

That was my first implementation, but it had portability issues (off the top of<br>

my head musl's non-extended regex syntax didn't support | and the \1 references<br>

don't reset at | either, plus I think there were circumstances in which ^ and $<br>

could get confused? Plus the empty regex line match thing...)<br>

<br>

So I replaced it in commit ca3528d7bf97. (Alas libc hasn't got a lex primitive<br>

the way it's got regexec.)<br>

<br>

> But this also had some potential problems:<br>

> 1. Does regexec always return the left-most longest match?<br>

<br>

Posix says it does:<br>

<br>

<a href="https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#:~:text=longest%20of%20the" rel="noreferrer" target="_blank">https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#:~:text=longest%20of%20the</a><br>

<br>

> 2. One large pattern might be problematic. `man 7 regex` threatens that<br>

> patterns larger than 256 bytes might not be supported by all regcomp<br>

> implementations.<br>

<br>

Yeah, musl fails.<br>

<br>

I don't actually know _what_ is slowing this down, but I'd guess "lack of cache<br>

locality" would be a significant part? Potentially a loop over the string with<br>

each pattern having "start of line" prepended to it might get some of the speed<br>

back? (Or again, it might be slower.)<br>

<br>

I could also sort the patterns by first character and loop through calling the<br>

appropriate regexes based on character at string position... which is harder<br>

than you think because there could be a | later in the pattern so that<br>

optimization only applies to a _subset_ of patterns. (First char being . or * or<br>

( kinda throws that off to, and that optimization would also have to be aware of<br>

case insensitivity...)<br>

<br>

There are things that can be done to speed this up, but we'd have to try and see<br>

what actually helped...<br>

<br>

> Thoughts on how risky this change could be? I could start sending you the<br>

> patches if you think the pros outweigh the potential cons?<br>

<br>

Gnu grep has its own bespoke regex implementation internally, because gnu. While<br>

I _have_ historically written my own regex implementation from scratch (using<br>

glob syntax rather than regex but same general idea), I chose not to do so here<br>

because project philosophy. This has downsides. There are things that can be<br>

done to speed it up, but I'd prefer a more real world test case to optimize<br>

against if we go down that path? (This is why I usually wait until somebody<br>

complains before optimizing or elaborating too much: they bring test cases. :)<br></blockquote><div><br></div><div>note that this is the second instance we've had of this problem. the previous googler to mention this was on a different team, working on something entirely different, but using grep in a similar way. they both boil down to "looking for one of a large number of files in a very large build log" though.</div><div><br></div><div>yochiang's example here only looks artificial because it deliberately is --- it's a way to reproduce without having to pay to do massive Android builds.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

(Also, gnu's bespoke regex is working on blocks of data that haven't been<br>

delineated into lines, again for speed reasons. That's less of an issue here,<br>

but I did read a paper on it when I was writing my own grep...)<br>

<br>

> Yi-yo<br>

<br>

Rob<br>

_______________________________________________<br>

Toybox mailing list<br>

<a href="mailto:Toybox@lists.landley.net" target="_blank">Toybox@lists.landley.net</a><br>

<a href="http://lists.landley.net/listinfo.cgi/toybox-landley.net" rel="noreferrer" target="_blank">http://lists.landley.net/listinfo.cgi/toybox-landley.net</a><br>

</blockquote></div></div>