[Toybox] Unicode string comparison.

enh enh at google.com
Thu Sep 22 08:48:20 PDT 2022


On Thu, Sep 22, 2022 at 2:43 AM Rob Landley <rob at landley.net> wrote:

> If you you have the same set of combining characters in a different order,
> is
> the result still considered the same character for string matching
> purposes?
>

"depends". there are multiple normalization forms.
https://unicode.org/reports/tr15/#Norm_Forms includes examples.

there _is_ a form where the combining characters are essentially sorted and
the group is compared in that order rather than the order they appeared in
the original input.

afaik unix command line stuff has never dealt with normalization though?
(sounds more unixy to have a separate normalization filter that lets you
choose which kind you want. but i'm not aware that that exists either. like
with tr -- no, i won't start that argument again -- i suspect these tools
aren't the droids you're looking for if you're actually doing serious
linguistic stuff.)

Does a regex . wildcard eat a unicode character _and_ trailing combining
> characters, or do you need a seperate . for each code point whether or not
> it
> displays? (Do you wind up just ignoring the combining characters and
> matching
> only the characters with a width, or do you just match each unicode point
> which
> must occur in sequence? I'm assuming none of the combining characters are
> changed via towupper()?)
>

https://unicode.org/reports/tr18/#Introduction --- aiui level 1 says "no"
[`.` matches a non-normalized code point], but level 2 says "yes" [`.` uses
normalization as mentioned above].

i think for command-line tools, you're just aiming for level 1.

"For most full-featured regular expression engines, it is quite difficult
to match under canonical equivalence, which may involve reordering,
splitting, or merging of characters."


> Rob
>
> P.S. For the moment in my attempts to speed up grep I'm just treating "has
> a
> byte > 127 in it" as "feed it to the regex engine and let REG_ICASE deal
> with
> it". That's not what the BSD one did, but in the absence of use cases
> where that
> comes up that I need to accelerate. My only use case being hex digits
> theoretically means I could have used a hash bucket size of 16, but I'm
> assuming
> that's not what real test data is doing...
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20220922/611e2a12/attachment.htm>


More information about the Toybox mailing list