[Toybox] Slow grep

enh enh at google.com
Thu Sep 15 14:32:01 PDT 2022


On Thu, Sep 15, 2022 at 1:45 PM Rob Landley <rob at landley.net> wrote:

> On 9/15/22 07:30, Yi-yo Chiang via Toybox wrote:
> > grep is slow when the number of patterns is large.
> ...
> >   xxd -p -c 0 -l 40 /dev/urandom
>
> Huh, WHY does that produce two lines of output?
>
> $ xxd -p -c 0 -l 40 /dev/urandom
> 1fcf13e1b4844ba209fb9958bde26a13577c577744f1b1290240d03f4f8e
> 644fd0687c39b1aa8a68
>
> Behavior is consistent between toybox xxd and debian's, but Elliott sent
> in the
> xxd implementation and I don't use it much, so... Feeding in -c 40 and -c
> 80
> make no difference?
>

i think that's actually a bug caused by this:

  // Plain style is 30 bytes/line, no grouping.

  if (FLAG(p)) TT.c = TT.g = 30;

should presumably be

  // Plain style is 30 bytes/line, no grouping.
  if (FLAG(p)) {
    if (!TT.c) TT.c = 30;
    if (!TT.g) TT.g = 30;
  }

?

certainly "real" xxd works for me on macos and debian, both of which have
the same version of xxd:

~$ xxd --version
xxd 2022-01-14 by Juergen Weigert et al.
~$ xxd -p -c 0 -l 40 /dev/urandom
ac160632955aa9d938e60d3533cbcf0febb4decdd12f130e415913ff1fe6e2abcaf7c4a8e980de7a

ah, but on another box with 2021-10-22 it's broken. so it looks like "real"
xxd had the same bug and fixed it recently?

Huh. Oh well, tangent. I'll assume the test you sent me is the test you
> meant to
> send me.
>
> > A simple optimization would be to concat all patterns into one huge regex
> > pattern with '|'.
>
> Which you could do in your pattern creation just as easily as grep could
> do so
> internally? Except when I tried it:
>
> $ time toybox grep -Ef <(tr '\n' '|' < test_large_10000 | head -c -1) \
>   test_large_10000
>
> The result was even _slower_ on glibc (and gave me "out of memory" on
> musl).
>

fwiw, BSD grep which we used to use before had
https://android.googlesource.com/platform/system/core/+/kitkat-release/toolbox/grep/fastgrep.c#32
which (although i haven't tested) i always assumed was for this.


> > I've hacked something together locally to do this, and the
> > improvement is evident (runtime reduced to ~2s).
>
> That was my first implementation, but it had portability issues (off the
> top of
> my head musl's non-extended regex syntax didn't support | and the \1
> references
> don't reset at | either, plus I think there were circumstances in which ^
> and $
> could get confused? Plus the empty regex line match thing...)
>
> So I replaced it in commit ca3528d7bf97. (Alas libc hasn't got a lex
> primitive
> the way it's got regexec.)
>
> > But this also had some potential problems:
> > 1. Does regexec always return the left-most longest match?
>
> Posix says it does:
>
>
> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#:~:text=longest%20of%20the
>
> > 2. One large pattern might be problematic. `man 7 regex` threatens that
> > patterns larger than 256 bytes might not be supported by all regcomp
> > implementations.
>
> Yeah, musl fails.
>
> I don't actually know _what_ is slowing this down, but I'd guess "lack of
> cache
> locality" would be a significant part? Potentially a loop over the string
> with
> each pattern having "start of line" prepended to it might get some of the
> speed
> back? (Or again, it might be slower.)
>
> I could also sort the patterns by first character and loop through calling
> the
> appropriate regexes based on character at string position... which is
> harder
> than you think because there could be a | later in the pattern so that
> optimization only applies to a _subset_ of patterns. (First char being .
> or * or
> ( kinda throws that off to, and that optimization would also have to be
> aware of
> case insensitivity...)
>
> There are things that can be done to speed this up, but we'd have to try
> and see
> what actually helped...
>
> > Thoughts on how risky this change could be? I could start sending you the
> > patches if you think the pros outweigh the potential cons?
>
> Gnu grep has its own bespoke regex implementation internally, because gnu.
> While
> I _have_ historically written my own regex implementation from scratch
> (using
> glob syntax rather than regex but same general idea), I chose not to do so
> here
> because project philosophy. This has downsides. There are things that can
> be
> done to speed it up, but I'd prefer a more real world test case to optimize
> against if we go down that path? (This is why I usually wait until somebody
> complains before optimizing or elaborating too much: they bring test
> cases. :)
>

note that this is the second instance we've had of this problem. the
previous googler to mention this was on a different team, working on
something entirely different, but using grep in a similar way. they both
boil down to "looking for one of a large number of files in a very large
build log" though.

yochiang's example here only looks artificial because it deliberately is
--- it's a way to reproduce without having to pay to do massive Android
builds.


> (Also, gnu's bespoke regex is working on blocks of data that haven't been
> delineated into lines, again for speed reasons. That's less of an issue
> here,
> but I did read a paper on it when I was writing my own grep...)
>
> > Yi-yo
>
> Rob
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20220915/37a613f3/attachment.htm>


More information about the Toybox mailing list