[Toybox] Slow grep

Yi-yo Chiang yochiang at google.com
Thu Sep 15 05:30:00 PDT 2022


grep is slow when the number of patterns is large.

We have some tests that use `grep -v -f $file1 $file2` to check if $file1
is a superset of $file2, and if not, then generate a report of what is in
$file2 but not $file1.
The command completes reasonably fast on desktop linux (grep (GNU grep)
3.7), but very slow on CI. Digging further shows that CI uses toybox grep.

For who's interested, this is the test script (
https://cs.android.com/android/platform/superproject/+/master:development/gki/kmi_abi_chk/kmi_compatibility_test.sh
).
Or just use this command to reproduce:
```
for i in {1..10000}; do
  xxd -p -c 0 -l 40 /dev/urandom
done >test_large_10000
time toybox grep -f test_large_10000 test_large_10000
```
(This runs for ~50s on my linux desktop)

The issue is that for each pattern (-e arg and _lines_ in -f arg), toybox
regcomp a standalone regex object. For each input line, the entire list of
regex objects are looped over to find the longest match [1], and looped
multiple times to find all matches [2].
This means regexec would be called O(L x N) times, where L is number of
patterns, N is number of input lines.
[1]:
https://cs.android.com/android/platform/superproject/+/master:external/toybox/toys/posix/grep.c;l=212
[2]:
https://cs.android.com/android/platform/superproject/+/master:external/toybox/toys/posix/grep.c;l=167

A simple optimization would be to concat all patterns into one huge regex
pattern with '|'. I've hacked something together locally to do this, and
the improvement is evident (runtime reduced to ~2s).
But this also had some potential problems:
1. Does regexec always return the left-most longest match? `man 7 regex`
says yes, but `man regexec` says nothing about this guarantee. I think it
does guarantee "left-most" match, but it might not always return the
longest match if multiple alternative patterns (patterns joined with |) are
provided. I'm afraid this might be implementation dependent and some
regcomp/regexec impl could return the "first matched alternative pattern"
instead.
2. One large pattern might be problematic. `man 7 regex` threatens that
patterns larger than 256 bytes might not be supported by all regcomp
implementations.

Thoughts on how risky this change could be? I could start sending you the
patches if you think the pros outweigh the potential cons?

Yi-yo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20220915/3065509d/attachment.htm>


More information about the Toybox mailing list