[Toybox] strlower() bug

Fri May 31 10:53:08 PDT 2024

On Fri, May 31, 2024 at 12:30 PM Rob Landley <rob at landley.net> wrote:
>
> On 5/30/24 16:12, enh wrote:
> >> > hmm... looking at Apple's online FreeBSD code, it looks like they have
> >> > very different (presumably older) FreeBSD code
> >> > [https://opensource.apple.com/source/Libc/Libc-320.1.3/locale/FreeBSD/tolower.c.auto.html],
> >> > and the footer of the file that reads implies that they're using data
> >> > from Unicode 3.2 (released in 2002, which would make sense given the
> >> > 2002 BSD copyright date in the tolower.c source):
> >>
> >> Sigh, can't they just ship machine consumable bitmaps or something?
> >
> > because everyone wants different formats. even the same library has
> > changed over time. (and not just because characters went from 16 bits
> > to 21 bits!)
>
> Conversion from a simple format seems straightforward to me.
>
> Part of my frame of reference here is Tim Berners Lee inventing the 404 error.
> That was Tim's big advance that made HTML work where Ted Nelson's overdesigned
> hyper-cyber-iText didn't. Tim 80/20'd the problem by just handling the easy
> cases (we have the data) and punting the hard cases (updating links when they
> moved) to humans.
>
> Ted published his hyper-hype paper in 1965 and then failed to interest anyone in
> it for a quarter century before Tim made something actually useful (beating
> Gopher by about 6 months). Crediting Ted as the inventor of html is like
> crediting Jules Verne as the inventor of the submarine, or H.G. Wells as the
> (eventual) inventor of the time machine. (Lazerpig had a rant about this in his
> video on stealth planes: the inventor is the person who made it WORK, not who
> came up with the idea of humans flying or a knob on the wall that controls the
> air temperature.)
>
> So to me, the question is "how much can we put in a simple format", and then
> have a list of broken characters you need an exception handler function for. How
> do we 80/20 this?
>
> >> I can have
> >> my test plumbing pull "standards" files, ala:
> >>
> >> https://github.com/landley/toybox/blob/master/mkroot/packages/tests
> >>
> >> But an organization shipping a PDF or 9 interlocking JSON files with a turing
> >> complete stylesheet doesn't help much.
> >
> > (not really the point, but the one you want for the stuff you're
> > talking about here is actually just a text file.
>
> Let's see... Ah:
>
> https://www.unicode.org/L2/L1999/UnicodeData.html
>
> That's a bit long. My suggestion had 9 decimal numbers, this has "IDEOGRAPHIC
> TELEGRAPH SYMBOL FOR JANUARY" as one of fifteen fields, with "<compat> 0031
> 6708" being another single field. How nice. (And still extensive warnings that
> this doesn't cover everything. I think "too much is never enough" was an MTV
> slogan back in the 1980s? Ah, it's from "The Marriage of Figaro" in 1784.)

citation needed? (or if you want me to keep trying to think of where
that or something similar occurs in the libretto, at least tell me
whether it's an aria or recitative :-) )

> aosp/external/icu/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode/UnicodeData.txt
> aosp/external/icu/android_icu4j/src/main/tests/android/icu/dev/data/unicode/UnicodeData.txt
> aosp/external/icu/icu4c/source/data/unidata/UnicodeData.txt
> aosp/external/pcre/maint/Unicode.tables/UnicodeData.txt
> aosp/external/cronet/third_party/icu/source/data/unidata/UnicodeData.txt
> aosp/out/soong/workspace/external/cronet/third_party/icu/source/data/unidata/UnicodeData.txt
>
> Android seems to have checked in multiple copies of this file.
>
> $ for i in $THAT; do [ -n "$OLD" ] && diff -u $OLD $i; OLD=$i; done | grep +++
> +++ aosp/external/pcre/maint/Unicode.tables/UnicodeData.txt     2023-08-18
> 15:16:31.239657629 -0500
> +++ aosp/external/cronet/third_party/icu/source/data/unidata/UnicodeData.txt
> 2023-08-18 15:14:44.351661450 -0500
>
> And I need to re-pull my tree for them to match.
>
> > i've repeatedly been
> > tempted to teach unicode(1) to read it, since it's always installed on
> > macOS and debian anyway [for values of "always" that include "all my
> > machines, anyway"], to be able to show far more information about any
> > given character.)
>
> I've thrown a note on the todo heap...
>
> >> Which is _sad_ because there's only a dozen ispunct() variants that read a bit
> >> out of a bitmap (and haven't significantly changed since K&R: neither isblank()
> >> nor isascii() is worth the wrapper), plus a toupper/tolower pair that map
> >> integers with "no change" being the common case.
> >
> > (one of the things you'll learn from parsing the file is that that's
> > not how toupper()/tolower() works for all characters. plus there's
> > titlecase. plus case folding.)
>
> "For all characters". I'm just looking for low hanging fruit and a list of
> exceptions to punt to a function.
>
> >> Plus unicode has wcwidth().
> >
> > no, it doesn't. (i wouldn't be maintaining my own if it did!)
>
> In ascii, wcwidth() is basically isprint() plus "tab is weird".
>
> For unicode, wcwidth() comes into play. The unicode bureaucracy committee being
> too microsofted to competently provide one is irrelevant to wcwidth() not being
> needed for ascii.
>
> (I also note the assumption of monospaced fonts in all this. Java's
> fontmetrics() was about measuring pixel counts in non-monospaced fonts, which
> this doesn't even contemplate.)

this is why i keep telling you that wcwidth() only really makes sense
for tty-based stuff. and even there ... i'm curious whether the
different terminal emulators actually behave the same in any of the
interesting cases. (_especially_ when you get to the "that can't
happen in well-formed text in the language that uses that script"
cases.)

> >> So code, alpha, cntrl, digit, punct, space, width, upper, lower. Something like:
> >>
> >> 0,0,0,0,0,0,0,0,0
> >> 13,0,1,0,0,1,0,0,0
> >> 32,0,0,0,0,1,1,0,0
> >> 57,0,0,1,0,0,1,0,0
> >> 58,0,0,0,1,0,1,0,0
> >> 65,1,0,0,0,0,1,0,97
> >>
> >> No, that doesn't cover weird stuff like the right-to-left gearshift or the
> >> excluded mapping ranges or even the low ascii characters having special effects
> >> like newline and tab, but those aren't really "characters" are they?
> >
> > those are exactly the weeds where all the dragons lurk. even the
> > EastAsianWidth property, which is as close as unicode comes to having
> > "wcwidth()" has "ambiguous" _and_ "neutral" --- two distinct special
> > cases :-)
>
> I'm trying for html, not hypertext. I expect 404 errors something/someone else
> will have to handle. A function returning "dunno" is acceptable in this context.
> Somebody else writing a wrapper function to intercept "dunno" and handle 37
> weird bits is "an exercise left for the reader".
>
> >> Special
> >> case the special cases, don't try to represent them in a table like that beyond
> >> what ispunct() and toupper() and friends should return. (Maybe have a -1 width
> >> for "weird".)
> >>
> >> But again, that's my dunning-kruger talking. I don't see WHY it's so
> >> complicated. Arguing about efficient representation isn't the same as arguing
> >> about "this is the data, it should be easy to diff new releases against the
> >> previous release to see what changed, so why don't they publish this?"
> >
> > i suspect they'd ask "what do you need the diff for? surely you're not
> > _manually_ translating this into some other form?" :-)
>
> A) Nuts to their white mice.
>
> 2) I want to see what changed so I can confirm I can ignore it (or add "dunno").
>
> III) The python approach of enforcing version number without caring what's IN
> the version excludes the possibility of other implementations and extensions. If
> a Korean standards body wanted to take its country range and define its own
> local properties for code points within there, that's irrelevant to unicode
> committee draft document release versioning procedure appendix formatting
> clarification updates (volume III).
>
> The data should not be precious. It's just data. NOT being able to diff it is
> suspicious.
>
> >> Heck, if your width options are 0, 1, 2, and 3 (with 3 being "exception, look it
> >> up in another table"), all the data except case mapping is one byte per character...
> >
> > fwiw, because it's written in terms of icu4c,
>
> An external black box library dependency I don't want to import, and which you
> didn't want to include in static binaries. (And the above file list had 3 "icu"
> implementations next to each other.)
>
> > which is in turn mostly just exposing the unicode data,
>
> Can we go from "mostly" to "all"? :)
>
> Not that I particularly want to ship a large ascii table either. When I dug into
> musl's take on this, I was mostly reverse engineering their compression format
> and then going "huh, yeah you probably do want to compress this".
>
> I could generate the table I listed with a C program that runs ispunct() and
> similar on every unicode code point and outputs the result. I could then compare
> what musl, glibc, and bionic produce for their output. The problem is it's not
> authoritative, it's downwind of the "macos is still using 2002 data" issue that
> keeps provoking this. :(

i'm really confused that you keep mentioning ascii. if you really mean
ispunct() here, say, and not iswpunct(), then that's a completely
solved problem --- ispunct() only covers ascii, and there's no
implementation we've seen that differs from any of the others there.

       The c argument is an int, the value of which the application
       shall ensure is a character representable as an unsigned char or
       equal to the value of the macro EOF. If the argument has any
       other value, the behavior is undefined.

> > the bionic implementation of wcwidth()
> > gives a decent "pseudocode" view of how you'd implement it in terms of
> > the unicode data directly:
> > https://android.googlesource.com/platform/bionic/+/refs/heads/main/libc/bionic/wcwidth.cpp
> >
> > (at least "to the best of my knowledge". since there is no standard,
> > and this function most recently changed _yesterday_, i can give no
> > guarantee :-) )
>
> That looks like the exception handler wrapper function I was referring to
> earlier. :)
>
> None of this seems likely to handle my earlier "widest unicode characters"
> thread with the REAL oddball encodings, but none of the current ones do either
> and that's ok. Just acknowledging that there needs to BE a special case
> exception list is the first step to having a GOOD special case exception list
> that can include that sort of thing. (And have all the arguments about excluding
> stuff to keep it down to a dull roar...)
>
> I.E. if the table of standard data can't cover everything it shouldn't try to,
> so what's the sane subset we CAN cleanly automate?

well, the most likely exception you'll encounter isn't about the
_characters_ it's about the _locale_ you're asking the question for.
one problem with unification (not just "han") is that you have
multiple "characters" (in terms of "what do they mean"/"how do they
behave") mapped to the same codepoint. (specifically here i'm thinking
of turkish/azeri i.)

> Rob