[Toybox] Unicode string comparison.
Rob Landley
rob at landley.net
Fri Sep 23 01:36:40 PDT 2022
On 9/22/22 10:48, enh wrote:
> On Thu, Sep 22, 2022 at 2:43 AM Rob Landley <rob at landley.net
> <mailto:rob at landley.net>> wrote:
>
> If you you have the same set of combining characters in a different order, is
> the result still considered the same character for string matching purposes?
>
> "depends". there are multiple normalization forms.
Oh joy.
> "For most full-featured regular expression engines, it is quite difficult to
> match under canonical equivalence, which may involve reordering, splitting, or
> merging of characters."
I've gone back to just punting unicode to regcomp() and friends: you stick a
character above 127 in your pattern and it's not taking the fast path I'm
implementing. (Not that I expect the regex engine to do better, but then it's
not _my_ fault quite so much.)
But I'm trying to understand regex escapes, and...
$ echo 'a[c' | grep 'a\[c'
a[c
$ echo 'a\bc' | grep 'a\bc'
$ echo abc | grep 'a\bc'
$ echo ac | grep 'a\bc'
$ echo 'a^c' | grep 'a\^c'
a\c
$ echo 'a^c' | grep 'a^c'
a^c
$ echo 'a\b' | grep 'a\b'
a\b
$ echo 'a\b' | grep 'a\b.'
a\b
$ echo 'a\b' | grep 'a\b..'
a\b
$ echo 'a\b' | grep 'a\b...'
$
I do not understand regex escapes. (This is all with the debian grep.)
Rob
More information about the Toybox
mailing list