[Toybox] Slow grep

enh enh at google.com
Fri Sep 30 08:38:42 PDT 2022


On Wed, Sep 28, 2022 at 10:20 PM Rob Landley <rob at landley.net> wrote:

> On 9/28/22 14:54, enh wrote:
> > heh... funny you should mention "." in the first position...
> >
> > $ echo "foo.jar" | old-toybox grep '\.jar'
> > foo.jar
> > $ echo "foo.jar" | new-toybox grep '\.jar'
> > $
>
> Technically that's '\' in the first position, which is the problem.
>
> (That's both parsing wrong, and I had an optimization where it skips the
> first
> character checking fast pattern matches because obviously we already
> matched
> that to find the right bucket with the patterns for this character...
> EXCEPT if
> that first character is escaped then skipping one character of each side
> of the
> comparison will put the pattern traversal out of sync.)
>
> Meanwhile, a pattern '^$' should match ONLY empty lines, and the easiest
> way to
> handle that is just special case that one to the regex path. The fast path
> can
> still handle '^' and '$' which match every line: zero length pattern does
> a zero
> length match at each position, which -o has plumbing to filter out because
> *
> does that too since it's zero-or-more instances so a* is going to match for
> length zero at every place that does NOT have an 'a', ala:
>
>   $ echo '' | grep 'z*' | wc -l
>   1
>
> All '^' and '$' do is say the zero length match occurs just once at the
> start or
> end of every line, but the -o logic discards all zero length matches and
> the
> non-o logic just cares that there IS a match. UNLESS you're producing
> colored
> output, but the colored output mostly piggybacks on -o. There is a sort of
> terrible corner case though, which oddly enough makes this corner case
> visible!
>
> $ echo potato | grep --color=always '' | hd
> 00000000  70 6f 74 61 74 6f 0a                              |potato.|
> 00000007
> $ echo potato | toybox grep --color=always '' | hd
> 00000000  1b 5b 6d 1b 5b 31 3b 33  31 6d 1b 5b 6d 70 6f 74
> |.[m.[1;31m.[mpot|
> 00000010  61 74 6f 0a                                       |ato.|
> 00000014
> $ echo potato | toybox grep --color=always '$' | hd
> 00000000  1b 5b 6d 70 6f 74 61 74  6f 1b 5b 31 3b 33 31 6d
> |.[mpotato.[1;31m|
> 00000010  1b 5b 6d 0a                                       |.[m.|
> 00000014
>
> Yeah, I have a todo item to try to optimize the color escape generation for
> toybox but last time I sat down and looked at it I just didn't have the
> spoons.
>
> > that a simplified case based on the real-world build break caused by
> ...
> > merging zero jar files into one instead of all the jar files :-)
>
> Acknowledged. Try commit 193009855266?
>

the updated build prebuilts pass all the presubmit tests at least :-)

i'll refrain from switching everyone over on a friday (that's already a
saturday for some people) though, and will do that on (my) monday morning
instead...


> Also, I checked in the start of the sed/tar xform protocol stuff. It's not
> handling flags yet, but it's passing the data so it _can_ do so. (I didn't
> want
> to send you a tar/sed pair that had to be upgraded in tandem and then
> changed
> AGAIN later. Hopefully this is the usable protocol version...)
>

i'll wait for an "all clear" before i poke further :-)


> Rob
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20220930/75b4037a/attachment.htm>


More information about the Toybox mailing list