[Toybox] Slow grep

Wed Sep 28 12:54:06 PDT 2022

heh... funny you should mention "." in the first position...

$ echo "foo.jar" | old-toybox grep '\.jar'
foo.jar
$ echo "foo.jar" | new-toybox grep '\.jar'
$

that a simplified case based on the real-world build break caused by
https://cs.android.com/android/platform/superproject/+/master:prebuilts/cmdline-tools/Android.bp;l=94;drc=6fa24e3c9dce05b5440b68997b58df791dddd617
merging zero jar files into one instead of all the jar files :-)

On Tue, Sep 27, 2022 at 1:03 AM Rob Landley <rob at landley.net> wrote:

> On 9/27/22 01:47, Yi-yo Chiang wrote:
> > (I suffered a system failure during the weekend and lost all my browser
> tabs and
> > consequently lost track of all my work...)
> >
> > Might be a stupid question, but what's so special about ". in first
> position"?
>
> Because the new fast path iterates over each character in the input
> string, and
> at each position checks the patterns that start with that character:
>
>   for (ss = start; ss-line<ulen; ss++) {
>     ii = FLAG(i) ? toupper(*ss) : *ss;
>     for (seek = TT.fixed[ii]; seek; seek = seek->next) {
>
> If I allow "." in first position then I need to repeat the inner loop
> starting
> with seek = TT.fixed['.'] at each position, which is doable but awkward
> enough I
> thought I'd wait for somebody to complain. :)
>
> > Doesn't ANY '.' in ANY position (sans <backslash><dot>) indicate a regex
> (not
> > fixed pattern)?
>
> The new fixed pattern search logic can (when FLAG(F) isn't set) handle some
> simple regex logic: the "." single character wildcard character, ^ at
> start and
> $ at end, and \ escapes that make regex chars literal. (This is a subset
> of what
> the openbsd one was doing. They also did unicode stuff which I punted on:
> any
> byte > 127 puts the pattern into the slow path that falls back to the old
> regcomp and have regexec scan each string from start to finish for each
> pattern.)
>
> The new fast path logic has two main chunks, in parse_regex() after the
> comment:
>
> // Convert to regex where appropriate
>
> It works out which patterns the fast path can handle. (Including some magic
> tests like if what you've \ escaped is a ( and you're not in -E extended
> regex
> mode, then the escape is actually ACTIVATING it not deactivating it. Regex
> escaping is generally weird, as I believe I complained here. The fast path
> logic
> only handles a subset of it, and punts to the rest to the slow path.)
>
> Once it's identified which patterns can be peeled out into the fast path,
> it
> then bucket sorts them by first character into a 256 entry array. (Since I
> punt
> anything with UTF-8 in it to the slow path the array would only need to be
> 128
> bytes long... except -F can make ANY string take the fast path because it
> forces
> literal match with no special behavior of any character. So I need the top
> half
> of the array for that.)
>
> Then the logic that actually USES the -F patterns is in do_grep() after
> the comment:
>
> // Handle "fixed" (literal) matches (if any)
>
> Which includes those two loops I pasted above.
>
> I can complicate the fast path a bit more if additional use cases show up,
> but
> in the absence of people giving me real world data... I implemented more
> than
> just the new test case, but less than openbsd was doing.
>
> Rob
>
> P.S. if you have a file of filenames you probably want both -F and -f
> because
> otherwise "potato.txt" will match "potatootxt" because regex. But
> presumably
> that was true of the gnu/dammit and openbsd vesions too...
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20220928/0551f90b/attachment-0001.htm>