[Toybox] Slow grep

Rob Landley rob at landley.net
Thu Sep 15 13:54:07 PDT 2022


On 9/15/22 07:30, Yi-yo Chiang via Toybox wrote:
> grep is slow when the number of patterns is large.
...
>   xxd -p -c 0 -l 40 /dev/urandom

Huh, WHY does that produce two lines of output?

$ xxd -p -c 0 -l 40 /dev/urandom
1fcf13e1b4844ba209fb9958bde26a13577c577744f1b1290240d03f4f8e
644fd0687c39b1aa8a68

Behavior is consistent between toybox xxd and debian's, but Elliott sent in the
xxd implementation and I don't use it much, so... Feeding in -c 40 and -c 80
make no difference?

Huh. Oh well, tangent. I'll assume the test you sent me is the test you meant to
send me.

> A simple optimization would be to concat all patterns into one huge regex
> pattern with '|'.

Which you could do in your pattern creation just as easily as grep could do so
internally? Except when I tried it:

$ time toybox grep -Ef <(tr '\n' '|' < test_large_10000 | head -c -1) \
  test_large_10000

The result was even _slower_ on glibc (and gave me "out of memory" on musl).

> I've hacked something together locally to do this, and the
> improvement is evident (runtime reduced to ~2s).

That was my first implementation, but it had portability issues (off the top of
my head musl's non-extended regex syntax didn't support | and the \1 references
don't reset at | either, plus I think there were circumstances in which ^ and $
could get confused? Plus the empty regex line match thing...)

So I replaced it in commit ca3528d7bf97. (Alas libc hasn't got a lex primitive
the way it's got regexec.)

> But this also had some potential problems:
> 1. Does regexec always return the left-most longest match?

Posix says it does:

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#:~:text=longest%20of%20the

> 2. One large pattern might be problematic. `man 7 regex` threatens that
> patterns larger than 256 bytes might not be supported by all regcomp
> implementations.

Yeah, musl fails.

I don't actually know _what_ is slowing this down, but I'd guess "lack of cache
locality" would be a significant part? Potentially a loop over the string with
each pattern having "start of line" prepended to it might get some of the speed
back? (Or again, it might be slower.)

I could also sort the patterns by first character and loop through calling the
appropriate regexes based on character at string position... which is harder
than you think because there could be a | later in the pattern so that
optimization only applies to a _subset_ of patterns. (First char being . or * or
( kinda throws that off to, and that optimization would also have to be aware of
case insensitivity...)

There are things that can be done to speed this up, but we'd have to try and see
what actually helped...

> Thoughts on how risky this change could be? I could start sending you the
> patches if you think the pros outweigh the potential cons?

Gnu grep has its own bespoke regex implementation internally, because gnu. While
I _have_ historically written my own regex implementation from scratch (using
glob syntax rather than regex but same general idea), I chose not to do so here
because project philosophy. This has downsides. There are things that can be
done to speed it up, but I'd prefer a more real world test case to optimize
against if we go down that path? (This is why I usually wait until somebody
complains before optimizing or elaborating too much: they bring test cases. :)

(Also, gnu's bespoke regex is working on blocks of data that haven't been
delineated into lines, again for speed reasons. That's less of an issue here,
but I did read a paper on it when I was writing my own grep...)

> Yi-yo

Rob


More information about the Toybox mailing list