[Toybox] Slow grep

Thu Sep 15 18:42:16 PDT 2022

On 9/15/22 16:32, enh wrote:
> ah, but on another box with 2021-10-22 it's broken. so it looks like "real" xxd
> had the same bug and fixed it recently?

Easy enough to fix, I was just confused by "the one in debian fails the same way"...

>     Huh. Oh well, tangent. I'll assume the test you sent me is the test you meant to
>     send me.
> 
>     > A simple optimization would be to concat all patterns into one huge regex
>     > pattern with '|'.
> 
>     Which you could do in your pattern creation just as easily as grep could do so
>     internally? Except when I tried it:
> 
>     $ time toybox grep -Ef <(tr '\n' '|' < test_large_10000 | head -c -1) \
>       test_large_10000
> 
>     The result was even _slower_ on glibc (and gave me "out of memory" on musl).
> 
> fwiw, BSD grep which we used to use before
> had https://android.googlesource.com/platform/system/core/+/kitkat-release/toolbox/grep/fastgrep.c#32
> which (although i haven't tested) i always assumed was for this.

Yeah, doing more ourselves is the obvious way to go...

>     I could also sort the patterns by first character and loop through calling the
>     appropriate regexes based on character at string position... which is harder
>     than you think because there could be a | later in the pattern so that
>     optimization only applies to a _subset_ of patterns. (First char being . or * or
>     ( kinda throws that off to, and that optimization would also have to be aware of
>     case insensitivity...)

I was trying to avoid writing another regex implementation, but a fast subset
one that does parallel cache-local searching for simple patterns and hands off
to libc for anything complicated isn't that hard. And I can even cheat with a
bitmask for starting byte:

  unsigned mask[8], len;
  char *str;

  for (len = 0; len<str; len++) {
    if (!mask[char>>5][char&31]) continue;
    // actual comparison
  }

>     Gnu grep has its own bespoke regex implementation internally, because gnu. While
>     I _have_ historically written my own regex implementation from scratch (using
>     glob syntax rather than regex but same general idea), I chose not to do so here
>     because project philosophy. This has downsides. There are things that can be
>     done to speed it up, but I'd prefer a more real world test case to optimize
>     against if we go down that path? (This is why I usually wait until somebody
>     complains before optimizing or elaborating too much: they bring test cases. :)
> 
> note that this is the second instance we've had of this problem. the previous
> googler to mention this was on a different team, working on something entirely
> different, but using grep in a similar way. they both boil down to "looking for
> one of a large number of files in a very large build log" though.
> 
> yochiang's example here only looks artificial because it deliberately is ---
> it's a way to reproduce without having to pay to do massive Android builds.

Oh sure. Important use case, I'm on it. I just want to make sure I fix it
properly. (I asked because I was mentally thumbing through hashing approaches,
but 256 bytes is small enough to just have an array.)

Plumbing-wise this is sort of an autodetected per-pattern -F logic, except how
does that interact with -i...

/me wanders off to pace and stare at the ceiling a bit...

Rob