[Toybox] Unicode string comparison.

Fri Sep 23 01:36:40 PDT 2022

On 9/22/22 10:48, enh wrote:
> On Thu, Sep 22, 2022 at 2:43 AM Rob Landley <rob at landley.net
> <mailto:rob at landley.net>> wrote:
> 
>     If you you have the same set of combining characters in a different order, is
>     the result still considered the same character for string matching purposes?
> 
> "depends". there are multiple normalization forms.

Oh joy.

> "For most full-featured regular expression engines, it is quite difficult to
> match under canonical equivalence, which may involve reordering, splitting, or
> merging of characters."

I've gone back to just punting unicode to regcomp() and friends: you stick a
character above 127 in your pattern and it's not taking the fast path I'm
implementing. (Not that I expect the regex engine to do better, but then it's
not _my_ fault quite so much.)

But I'm trying to understand regex escapes, and...

$ echo 'a[c' | grep 'a\[c'
a[c
$ echo 'a\bc' | grep 'a\bc'
$ echo abc | grep 'a\bc'
$ echo ac | grep 'a\bc'
$ echo 'a^c' | grep 'a\^c'
a\c
$ echo 'a^c' | grep 'a^c'
a^c
$ echo 'a\b' | grep 'a\b'
a\b
$ echo 'a\b' | grep 'a\b.'
a\b
$ echo 'a\b' | grep 'a\b..'
a\b
$ echo 'a\b' | grep 'a\b...'
$

I do not understand regex escapes. (This is all with the debian grep.)

Rob