[Toybox] strlower() bug

enh enh at google.com
Thu May 30 14:12:43 PDT 2024


On Wed, May 29, 2024 at 5:50 AM Rob Landley <rob at landley.net> wrote:
>
> On 5/22/24 09:30, enh wrote:
> > On Tue, May 14, 2024 at 2:58 PM Rob Landley <rob at landley.net> wrote:
> >> It looks like macos towlower() refuses to return expanding unicode characters.
> >> Possibly to avoid exactly the kind of bug this fixed, in exchange for corrupting
> >> the data.
> >
> > yeah, i don't know whether it's on purpose or a bug, but that does
> > seem to be the case... i tested with another Latin Extended-B
> > character whose uppercase and lowercase forms are both in the same
> > block (and thus have the same utf8 encoding length), and macOS
> > towlower() does work for that.
> >
> > hmm, actually maybe it's just that their Unicode data is out of date?
> > it looks like they don't know about Latin Extended-C at all? a code
> > point like U+2c62 that gets _smaller_ (because it's in the IPA
> > Extensions block) doesn't work either.
> >
> > i did try looking in FreeBSD, but i've never understood how this stuff
> > works there.
>
> FreeBSD questions go to Ed Maste <emaste at freebsd.org> who is theoretically
> subscribed here but keeps getting unsubscribed by gmail bounces.
>
> > i'm guessing from the fact i've never found them that the
> > implementations are all generated at build time, subtly enough that my
> > attempts to grep for the generators fail.
> >
> > hmm... looking at Apple's online FreeBSD code, it looks like they have
> > very different (presumably older) FreeBSD code
> > [https://opensource.apple.com/source/Libc/Libc-320.1.3/locale/FreeBSD/tolower.c.auto.html],
> > and the footer of the file that reads implies that they're using data
> > from Unicode 3.2 (released in 2002, which would make sense given the
> > 2002 BSD copyright date in the tolower.c source):
>
> Sigh, can't they just ship machine consumable bitmaps or something?

because everyone wants different formats. even the same library has
changed over time. (and not just because characters went from 16 bits
to 21 bits!)

> I can have
> my test plumbing pull "standards" files, ala:
>
> https://github.com/landley/toybox/blob/master/mkroot/packages/tests
>
> But an organization shipping a PDF or 9 interlocking JSON files with a turing
> complete stylesheet doesn't help much.

(not really the point, but the one you want for the stuff you're
talking about here is actually just a text file. i've repeatedly been
tempted to teach unicode(1) to read it, since it's always installed on
macOS and debian anyway [for values of "always" that include "all my
machines, anyway"], to be able to show far more information about any
given character.)

> > so, yeah, i don't think there was anything clever or mysterious going
> > on here --- macOS is just using Unicode data from 22 years ago. (which
> > is an amusing real-world example of why i keep saying "you probably
> > don't want to get into the business of redistributing Unicode data; it
> > changes every year" :-) )
>
> A youtuber named Ryan McBeth is fond of explaining the difference between a
> "problem" and a "dilemma". A problem has an obvious solution, which may be
> painful or expensive but there's not a lot of disagreement on what success looks
> like. A dilemma has multiple ways to address it, each of which has something
> uniquely wrong with it. Problems don't lead to indecision, dilemmas do (and thus
> accumulate).
>
> In this case, the dilemma is "trusting libc to get it wrong differently in each
> new environment" vs "taking a large expense onboard with borderline xkcd
> violation". (If there is an xkcd strip explaining why not to do something, you
> probably shouldn't do it. In this case https://xkcd.com/927/ )
>
> Which is _sad_ because there's only a dozen ispunct() variants that read a bit
> out of a bitmap (and haven't significantly changed since K&R: neither isblank()
> nor isascii() is worth the wrapper), plus a toupper/tolower pair that map
> integers with "no change" being the common case.

(one of the things you'll learn from parsing the file is that that's
not how toupper()/tolower() works for all characters. plus there's
titlecase. plus case folding.)

> Plus unicode has wcwidth().

no, it doesn't. (i wouldn't be maintaining my own if it did!)

> Yes, it's over a (sparse!) table with space for a million entries, but CSV
> encoding all that data in human+machine readable ASCII should gzip down to what,
> 500k?
>
> Let's see, the bits seem to be alpha, cntrl, digit, punct, and space, and then
> width (mostly 0, 1, or 2 but we've talked about exceptions), and two translation
> codepoints for toupper and tolower.
>
> You can easily derive isalnum() and isxdigit(), and isascii() and isblank() are
> trivial according to the man page. If the table has upper and lower mappings
> (I.E. what character this turns into, zero if it doesn't) then you don't need
> isupper() or islower() bits unless there's cases where "this isn't upper case
> but can be converted to lower case" (which aren't covered by having BOTH
> toupper() and tolower() mappings for the same character).
>
> I'm honestly unclear on what "isgraph" does, "any printable character except
> space"... if isprint() means "not width 0" then that's just adding && !isspace()
> so doesn't need to be in the table.
>
> So code, alpha, cntrl, digit, punct, space, width, upper, lower. Something like:
>
> 0,0,0,0,0,0,0,0,0
> 13,0,1,0,0,1,0,0,0
> 32,0,0,0,0,1,1,0,0
> 57,0,0,1,0,0,1,0,0
> 58,0,0,0,1,0,1,0,0
> 65,1,0,0,0,0,1,0,97
>
> No, that doesn't cover weird stuff like the right-to-left gearshift or the
> excluded mapping ranges or even the low ascii characters having special effects
> like newline and tab, but those aren't really "characters" are they?

those are exactly the weeds where all the dragons lurk. even the
EastAsianWidth property, which is as close as unicode comes to having
"wcwidth()" has "ambiguous" _and_ "neutral" --- two distinct special
cases :-)

> Special
> case the special cases, don't try to represent them in a table like that beyond
> what ispunct() and toupper() and friends should return. (Maybe have a -1 width
> for "weird".)
>
> But again, that's my dunning-kruger talking. I don't see WHY it's so
> complicated. Arguing about efficient representation isn't the same as arguing
> about "this is the data, it should be easy to diff new releases against the
> previous release to see what changed, so why don't they publish this?"

i suspect they'd ask "what do you need the diff for? surely you're not
_manually_ translating this into some other form?" :-)

> Heck, if your width options are 0, 1, 2, and 3 (with 3 being "exception, look it
> up in another table"), all the data except case mapping is one byte per character...

fwiw, because it's written in terms of icu4c, which is in turn mostly
just exposing the unicode data, the bionic implementation of wcwidth()
gives a decent "pseudocode" view of how you'd implement it in terms of
the unicode data directly:
https://android.googlesource.com/platform/bionic/+/refs/heads/main/libc/bionic/wcwidth.cpp

(at least "to the best of my knowledge". since there is no standard,
and this function most recently changed _yesterday_, i can give no
guarantee :-) )

> [Have pondered for a while. Dunno what else to say, so pressing send.]
>
> Rob


More information about the Toybox mailing list