[Toybox] Slow grep

Yi-yo Chiang yochiang at google.com
Mon Oct 3 01:43:26 PDT 2022


On Thu, Sep 29, 2022 at 1:20 PM Rob Landley <rob at landley.net> wrote:

> On 9/28/22 14:54, enh wrote:
> > heh... funny you should mention "." in the first position...
> >
> > $ echo "foo.jar" | old-toybox grep '\.jar'
> > foo.jar
> > $ echo "foo.jar" | new-toybox grep '\.jar'
> > $
>
> Technically that's '\' in the first position, which is the problem.
>
> (That's both parsing wrong, and I had an optimization where it skips the
> first
> character checking fast pattern matches because obviously we already
> matched
> that to find the right bucket with the patterns for this character...
> EXCEPT if
> that first character is escaped then skipping one character of each side
> of the
> comparison will put the pattern traversal out of sync.)
>
> Meanwhile, a pattern '^$' should match ONLY empty lines, and the easiest
> way to
> handle that is just special case that one to the regex path. The fast path
> can
> still handle '^' and '$' which match every line: zero length pattern does
> a zero
> length match at each position, which -o has plumbing to filter out because
> *
> does that too since it's zero-or-more instances so a* is going to match for
> length zero at every place that does NOT have an 'a', ala:
>
>   $ echo '' | grep 'z*' | wc -l
>   1
>
> All '^' and '$' do is say the zero length match occurs just once at the
> start or
> end of every line, but the -o logic discards all zero length matches and
> the
> non-o logic just cares that there IS a match. UNLESS you're producing
> colored
> output, but the colored output mostly piggybacks on -o. There is a sort of
> terrible corner case though, which oddly enough makes this corner case
> visible!
>
> $ echo potato | grep --color=always '' | hd
> 00000000  70 6f 74 61 74 6f 0a                              |potato.|
> 00000007
> $ echo potato | toybox grep --color=always '' | hd
> 00000000  1b 5b 6d 1b 5b 31 3b 33  31 6d 1b 5b 6d 70 6f 74
> |.[m.[1;31m.[mpot|
> 00000010  61 74 6f 0a                                       |ato.|
> 00000014
> $ echo potato | toybox grep --color=always '$' | hd
> 00000000  1b 5b 6d 70 6f 74 61 74  6f 1b 5b 31 3b 33 31 6d
> |.[mpotato.[1;31m|
> 00000010  1b 5b 6d 0a                                       |.[m.|
> 00000014
>
> Yeah, I have a todo item to try to optimize the color escape generation for
> toybox but last time I sat down and looked at it I just didn't have the
> spoons.
>

I have some optimizations related to color output, not sure if these are
what you had in mind?
* If not showing color (no --color) or not showing matched part (-v) and
(-o, -w, -x) are not given, then call regcomp() with REG_NOSUB. (could
potentially improve the compiled regex)
* If regex pattern was compiled with REG_NOSUB (we are not
printing/highlighting matched part), then we can stop pattern matching as
soon as we found first hit (no need to loop through all patterns to find
longest match nor all matches) (break this loop early
https://github.com/landley/toybox/blob/b17dc8e111dd408db04ea7ae70c410fd3054a751/toys/posix/grep.c#L218
).

I can send you these patches after I rebase my local tree on top of the
recent bucket sort optimizations.


>
> > that a simplified case based on the real-world build break caused by
> ...
> > merging zero jar files into one instead of all the jar files :-)
>
> Acknowledged. Try commit 193009855266?
>

Still had problems when matching Fixed patterns:

$ echo '\.zip' | ./old-toybox grep -F '\.zip'
\.zip
$ echo '\.zip' | ./toybox grep -F '\.zip'
$

I think what's happening is the fixed string pattern '\.zip' gets put into
the '.' bucket (should be '\' bucket instead?), thus '\.zip' never matches
anything.


>
> Also, I checked in the start of the sed/tar xform protocol stuff. It's not
> handling flags yet, but it's passing the data so it _can_ do so. (I didn't
> want
> to send you a tar/sed pair that had to be upgraded in tandem and then
> changed
> AGAIN later. Hopefully this is the usable protocol version...)
>
> Rob
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20221003/494da88c/attachment.htm>


More information about the Toybox mailing list