<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Sep 22, 2022 at 2:43 AM Rob Landley <<a href="mailto:rob@landley.net">rob@landley.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">If you you have the same set of combining characters in a different order, is<br>

the result still considered the same character for string matching purposes?<br></blockquote><div><br></div><div>"depends". there are multiple normalization forms. <a href="https://unicode.org/reports/tr15/#Norm_Forms">https://unicode.org/reports/tr15/#Norm_Forms</a> includes examples.</div><div><br></div><div>there _is_ a form where the combining characters are essentially sorted and the group is compared in that order rather than the order they appeared in the original input.</div><div><br></div><div>afaik unix command line stuff has never dealt with normalization though? (sounds more unixy to have a separate normalization filter that lets you choose which kind you want. but i'm not aware that that exists either. like with tr -- no, i won't start that argument again -- i suspect these tools aren't the droids you're looking for if you're actually doing serious linguistic stuff.)</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Does a regex . wildcard eat a unicode character _and_ trailing combining<br>

characters, or do you need a seperate . for each code point whether or not it<br>

displays? (Do you wind up just ignoring the combining characters and matching<br>

only the characters with a width, or do you just match each unicode point which<br>

must occur in sequence? I'm assuming none of the combining characters are<br>

changed via towupper()?)<br></blockquote><div><br></div><div><a href="https://unicode.org/reports/tr18/#Introduction">https://unicode.org/reports/tr18/#Introduction</a> --- aiui level 1 says "no" [`.` matches a non-normalized code point], but level 2 says "yes" [`.` uses normalization as mentioned above].</div><div><br></div><div>i think for command-line tools, you're just aiming for level 1.<br></div><div><br></div><div>"For most full-featured regular expression engines, it is quite difficult to match under canonical equivalence, which may involve reordering, splitting, or merging of characters."</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Rob<br>

<br>

P.S. For the moment in my attempts to speed up grep I'm just treating "has a<br>

byte > 127 in it" as "feed it to the regex engine and let REG_ICASE deal with<br>

it". That's not what the BSD one did, but in the absence of use cases where that<br>

comes up that I need to accelerate. My only use case being hex digits<br>

theoretically means I could have used a hash bucket size of 16, but I'm assuming<br>

that's not what real test data is doing...<br>

_______________________________________________<br>

Toybox mailing list<br>

<a href="mailto:Toybox@lists.landley.net" target="_blank">Toybox@lists.landley.net</a><br>

<a href="http://lists.landley.net/listinfo.cgi/toybox-landley.net" rel="noreferrer" target="_blank">http://lists.landley.net/listinfo.cgi/toybox-landley.net</a><br>

</blockquote></div></div>