[Toybox] Slow grep

Tue Sep 27 01:12:22 PDT 2022

On 9/27/22 01:47, Yi-yo Chiang wrote:
> (I suffered a system failure during the weekend and lost all my browser tabs and
> consequently lost track of all my work...)
> 
> Might be a stupid question, but what's so special about ". in first position"?

Because the new fast path iterates over each character in the input string, and
at each position checks the patterns that start with that character:

  for (ss = start; ss-line<ulen; ss++) {
    ii = FLAG(i) ? toupper(*ss) : *ss;
    for (seek = TT.fixed[ii]; seek; seek = seek->next) {

If I allow "." in first position then I need to repeat the inner loop starting
with seek = TT.fixed['.'] at each position, which is doable but awkward enough I
thought I'd wait for somebody to complain. :)

> Doesn't ANY '.' in ANY position (sans <backslash><dot>) indicate a regex (not
> fixed pattern)?

The new fixed pattern search logic can (when FLAG(F) isn't set) handle some
simple regex logic: the "." single character wildcard character, ^ at start and
$ at end, and \ escapes that make regex chars literal. (This is a subset of what
the openbsd one was doing. They also did unicode stuff which I punted on: any
byte > 127 puts the pattern into the slow path that falls back to the old
regcomp and have regexec scan each string from start to finish for each pattern.)

The new fast path logic has two main chunks, in parse_regex() after the comment:

// Convert to regex where appropriate

It works out which patterns the fast path can handle. (Including some magic
tests like if what you've \ escaped is a ( and you're not in -E extended regex
mode, then the escape is actually ACTIVATING it not deactivating it. Regex
escaping is generally weird, as I believe I complained here. The fast path logic
only handles a subset of it, and punts to the rest to the slow path.)

Once it's identified which patterns can be peeled out into the fast path, it
then bucket sorts them by first character into a 256 entry array. (Since I punt
anything with UTF-8 in it to the slow path the array would only need to be 128
bytes long... except -F can make ANY string take the fast path because it forces
literal match with no special behavior of any character. So I need the top half
of the array for that.)

Then the logic that actually USES the -F patterns is in do_grep() after the comment:

// Handle "fixed" (literal) matches (if any)

Which includes those two loops I pasted above.

I can complicate the fast path a bit more if additional use cases show up, but
in the absence of people giving me real world data... I implemented more than
just the new test case, but less than openbsd was doing.

Rob

P.S. if you have a file of filenames you probably want both -F and -f because
otherwise "potato.txt" will match "potatootxt" because regex. But presumably
that was true of the gnu/dammit and openbsd vesions too...