[Toybox] Slow grep
Rob Landley
rob at landley.net
Thu Sep 15 18:42:16 PDT 2022
On 9/15/22 16:32, enh wrote:
> ah, but on another box with 2021-10-22 it's broken. so it looks like "real" xxd
> had the same bug and fixed it recently?
Easy enough to fix, I was just confused by "the one in debian fails the same way"...
> Huh. Oh well, tangent. I'll assume the test you sent me is the test you meant to
> send me.
>
> > A simple optimization would be to concat all patterns into one huge regex
> > pattern with '|'.
>
> Which you could do in your pattern creation just as easily as grep could do so
> internally? Except when I tried it:
>
> $ time toybox grep -Ef <(tr '\n' '|' < test_large_10000 | head -c -1) \
> test_large_10000
>
> The result was even _slower_ on glibc (and gave me "out of memory" on musl).
>
> fwiw, BSD grep which we used to use before
> had https://android.googlesource.com/platform/system/core/+/kitkat-release/toolbox/grep/fastgrep.c#32
> which (although i haven't tested) i always assumed was for this.
Yeah, doing more ourselves is the obvious way to go...
> I could also sort the patterns by first character and loop through calling the
> appropriate regexes based on character at string position... which is harder
> than you think because there could be a | later in the pattern so that
> optimization only applies to a _subset_ of patterns. (First char being . or * or
> ( kinda throws that off to, and that optimization would also have to be aware of
> case insensitivity...)
I was trying to avoid writing another regex implementation, but a fast subset
one that does parallel cache-local searching for simple patterns and hands off
to libc for anything complicated isn't that hard. And I can even cheat with a
bitmask for starting byte:
unsigned mask[8], len;
char *str;
for (len = 0; len<str; len++) {
if (!mask[char>>5][char&31]) continue;
// actual comparison
}
> Gnu grep has its own bespoke regex implementation internally, because gnu. While
> I _have_ historically written my own regex implementation from scratch (using
> glob syntax rather than regex but same general idea), I chose not to do so here
> because project philosophy. This has downsides. There are things that can be
> done to speed it up, but I'd prefer a more real world test case to optimize
> against if we go down that path? (This is why I usually wait until somebody
> complains before optimizing or elaborating too much: they bring test cases. :)
>
> note that this is the second instance we've had of this problem. the previous
> googler to mention this was on a different team, working on something entirely
> different, but using grep in a similar way. they both boil down to "looking for
> one of a large number of files in a very large build log" though.
>
> yochiang's example here only looks artificial because it deliberately is ---
> it's a way to reproduce without having to pay to do massive Android builds.
Oh sure. Important use case, I'm on it. I just want to make sure I fix it
properly. (I asked because I was mentally thumbing through hashing approaches,
but 256 bytes is small enough to just have an array.)
Plumbing-wise this is sort of an autodetected per-pattern -F logic, except how
does that interact with -i...
/me wanders off to pace and stare at the ceiling a bit...
Rob
More information about the Toybox
mailing list