[Toybox] incorrect output from grep -w --color

Mon Jan 22 19:09:42 PST 2024

On 1/17/24 18:57, Rob Landley wrote:
> On 1/17/24 15:16, Peter Collingbourne via Toybox wrote:
> Thanks. Digging into it...

Looks like I screwed up --color when adding the bucket sort logic. (I haven't
got regression tests for --color, because unless I'm exactly matching the ascii
escape sequences I'd just be filtering them out in the test...)

The --color logic is a repurposed variant of the -o logic, and what's happening
here is the regex-alikes are matching "a", then vetoing it because of the extra
-w stipulations, then doing the right thing for -o (not outputting it) instead
of the right thing for --color (extending the not-match output).

The logic found the best match and then did a FLAG(w) check afterwards to see if
it's a WORD match, and the problem is it skipped it if so, which advanced past
it, which is why it didn't get printed. Used to work with the old logic, but
inappropriate now with the new bucket sort stuff.

The fix is to move the FLAG(w) check into the two loops (fixed and regex
matches) so it vetoes each match immediately, and we never advance to the "skip
forward past this" logic for something that isn't a match. I made a function
that both can call so the code isn't TOO badly duplicated.

The hangup is that my first stab at fixing this broke the two -w '' tests, and
then I got confused staring at:

      // Empty pattern always matches
      if (rc && *TT.fixed && !FLAG(o)) rc = 0;

Because it LOOKS backwards: if rc is set and the first character of fixed is set
switch the match _off_? But rc is using syscall logic return code logic here,
where 0 means success and 1 means failure (because that's what regexec() returns
so it's all doing that). And then I had to reread parse_regex() to remind myself
that TT.fixed is an array of arrays (bucket sort, indexed by first letter), and
that fixed[0] means the first (non-special) character is zero, so the above
check is asking "do we have any zero length fixed match strings". (I left myself
a comment but it wasn't a BIG ENOUGH comment).

And THEN it turns out I never quite understood the -w logic in the first place,
and it took me a while to come up with even more tests:

$ echo | grep 'a'
$ echo | grep ''

$ echo 'one' | grep -w ''
$ echo 'one two' | grep -w ''
$ echo 'one  two' | grep -w ''
one  two
$ echo 'one ' | grep -w ''
one
$ echo ' one' | grep -w ''
 one
$ echo potato | grep '^'
potato
$ echo potato | grep '$'
potato
$ echo 'one  two' | grep -w '^'
$ echo 'one  two' | grep -w '$'

So without -w the strings ^ and $ match any line (because every line has a
beginning and an end), but ^$ only matches an empty line.

With -w the empty string only matches either an empty line or a line with two
spaces in it, and ^ matches an empty string OR a line with a space as the first
character, and $ matches the empty string OR a line with a space as the last
character.

Which means that empty string checking snippet ALSO needs to move inside the
loop with the rest of the -w logic.

At first I thought it doesn't move inside the function because empty strings are
always fixed matches (if there are any unescaped special characters to trigger
regex plumbing, it's not a fixed match). But then I thought of:

  $ echo 'one two' | grep -w 'a*'
  $ echo 'one  two' | grep -w 'a*'
  one  two

Because zero or more occurrences of "a" is a zero length match and of course I
have to police it for -w with ^ and $.

So if you're wondering why that's taking so long to fix... it's because my test
suite didn't have enough tests and I had work out how to add more.

Rob