[Toybox] Unicode string comparison.

Rob Landley rob at landley.net
Thu Sep 22 02:52:00 PDT 2022


If you you have the same set of combining characters in a different order, is
the result still considered the same character for string matching purposes?

Does a regex . wildcard eat a unicode character _and_ trailing combining
characters, or do you need a seperate . for each code point whether or not it
displays? (Do you wind up just ignoring the combining characters and matching
only the characters with a width, or do you just match each unicode point which
must occur in sequence? I'm assuming none of the combining characters are
changed via towupper()?)

Rob

P.S. For the moment in my attempts to speed up grep I'm just treating "has a
byte > 127 in it" as "feed it to the regex engine and let REG_ICASE deal with
it". That's not what the BSD one did, but in the absence of use cases where that
comes up that I need to accelerate. My only use case being hex digits
theoretically means I could have used a hash bucket size of 16, but I'm assuming
that's not what real test data is doing...


More information about the Toybox mailing list