[Toybox] strlower() bug

Rob Landley rob at landley.net
Fri May 31 09:41:04 PDT 2024


On 5/30/24 16:12, enh wrote:
>> > hmm... looking at Apple's online FreeBSD code, it looks like they have
>> > very different (presumably older) FreeBSD code
>> > [https://opensource.apple.com/source/Libc/Libc-320.1.3/locale/FreeBSD/tolower.c.auto.html],
>> > and the footer of the file that reads implies that they're using data
>> > from Unicode 3.2 (released in 2002, which would make sense given the
>> > 2002 BSD copyright date in the tolower.c source):
>>
>> Sigh, can't they just ship machine consumable bitmaps or something?
> 
> because everyone wants different formats. even the same library has
> changed over time. (and not just because characters went from 16 bits
> to 21 bits!)

Conversion from a simple format seems straightforward to me.

Part of my frame of reference here is Tim Berners Lee inventing the 404 error.
That was Tim's big advance that made HTML work where Ted Nelson's overdesigned
hyper-cyber-iText didn't. Tim 80/20'd the problem by just handling the easy
cases (we have the data) and punting the hard cases (updating links when they
moved) to humans.

Ted published his hyper-hype paper in 1965 and then failed to interest anyone in
it for a quarter century before Tim made something actually useful (beating
Gopher by about 6 months). Crediting Ted as the inventor of html is like
crediting Jules Verne as the inventor of the submarine, or H.G. Wells as the
(eventual) inventor of the time machine. (Lazerpig had a rant about this in his
video on stealth planes: the inventor is the person who made it WORK, not who
came up with the idea of humans flying or a knob on the wall that controls the
air temperature.)

So to me, the question is "how much can we put in a simple format", and then
have a list of broken characters you need an exception handler function for. How
do we 80/20 this?

>> I can have
>> my test plumbing pull "standards" files, ala:
>>
>> https://github.com/landley/toybox/blob/master/mkroot/packages/tests
>>
>> But an organization shipping a PDF or 9 interlocking JSON files with a turing
>> complete stylesheet doesn't help much.
> 
> (not really the point, but the one you want for the stuff you're
> talking about here is actually just a text file.

Let's see... Ah:

https://www.unicode.org/L2/L1999/UnicodeData.html

That's a bit long. My suggestion had 9 decimal numbers, this has "IDEOGRAPHIC
TELEGRAPH SYMBOL FOR JANUARY" as one of fifteen fields, with "<compat> 0031
6708" being another single field. How nice. (And still extensive warnings that
this doesn't cover everything. I think "too much is never enough" was an MTV
slogan back in the 1980s? Ah, it's from "The Marriage of Figaro" in 1784.)

aosp/external/icu/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode/UnicodeData.txt
aosp/external/icu/android_icu4j/src/main/tests/android/icu/dev/data/unicode/UnicodeData.txt
aosp/external/icu/icu4c/source/data/unidata/UnicodeData.txt
aosp/external/pcre/maint/Unicode.tables/UnicodeData.txt
aosp/external/cronet/third_party/icu/source/data/unidata/UnicodeData.txt
aosp/out/soong/workspace/external/cronet/third_party/icu/source/data/unidata/UnicodeData.txt

Android seems to have checked in multiple copies of this file.

$ for i in $THAT; do [ -n "$OLD" ] && diff -u $OLD $i; OLD=$i; done | grep +++
+++ aosp/external/pcre/maint/Unicode.tables/UnicodeData.txt	2023-08-18
15:16:31.239657629 -0500
+++ aosp/external/cronet/third_party/icu/source/data/unidata/UnicodeData.txt
2023-08-18 15:14:44.351661450 -0500

And I need to re-pull my tree for them to match.

> i've repeatedly been
> tempted to teach unicode(1) to read it, since it's always installed on
> macOS and debian anyway [for values of "always" that include "all my
> machines, anyway"], to be able to show far more information about any
> given character.)

I've thrown a note on the todo heap...

>> Which is _sad_ because there's only a dozen ispunct() variants that read a bit
>> out of a bitmap (and haven't significantly changed since K&R: neither isblank()
>> nor isascii() is worth the wrapper), plus a toupper/tolower pair that map
>> integers with "no change" being the common case.
> 
> (one of the things you'll learn from parsing the file is that that's
> not how toupper()/tolower() works for all characters. plus there's
> titlecase. plus case folding.)

"For all characters". I'm just looking for low hanging fruit and a list of
exceptions to punt to a function.

>> Plus unicode has wcwidth().
> 
> no, it doesn't. (i wouldn't be maintaining my own if it did!)

In ascii, wcwidth() is basically isprint() plus "tab is weird".

For unicode, wcwidth() comes into play. The unicode bureaucracy committee being
too microsofted to competently provide one is irrelevant to wcwidth() not being
needed for ascii.

(I also note the assumption of monospaced fonts in all this. Java's
fontmetrics() was about measuring pixel counts in non-monospaced fonts, which
this doesn't even contemplate.)

>> So code, alpha, cntrl, digit, punct, space, width, upper, lower. Something like:
>>
>> 0,0,0,0,0,0,0,0,0
>> 13,0,1,0,0,1,0,0,0
>> 32,0,0,0,0,1,1,0,0
>> 57,0,0,1,0,0,1,0,0
>> 58,0,0,0,1,0,1,0,0
>> 65,1,0,0,0,0,1,0,97
>>
>> No, that doesn't cover weird stuff like the right-to-left gearshift or the
>> excluded mapping ranges or even the low ascii characters having special effects
>> like newline and tab, but those aren't really "characters" are they?
> 
> those are exactly the weeds where all the dragons lurk. even the
> EastAsianWidth property, which is as close as unicode comes to having
> "wcwidth()" has "ambiguous" _and_ "neutral" --- two distinct special
> cases :-)

I'm trying for html, not hypertext. I expect 404 errors something/someone else
will have to handle. A function returning "dunno" is acceptable in this context.
Somebody else writing a wrapper function to intercept "dunno" and handle 37
weird bits is "an exercise left for the reader".

>> Special
>> case the special cases, don't try to represent them in a table like that beyond
>> what ispunct() and toupper() and friends should return. (Maybe have a -1 width
>> for "weird".)
>>
>> But again, that's my dunning-kruger talking. I don't see WHY it's so
>> complicated. Arguing about efficient representation isn't the same as arguing
>> about "this is the data, it should be easy to diff new releases against the
>> previous release to see what changed, so why don't they publish this?"
> 
> i suspect they'd ask "what do you need the diff for? surely you're not
> _manually_ translating this into some other form?" :-)

A) Nuts to their white mice.

2) I want to see what changed so I can confirm I can ignore it (or add "dunno").

III) The python approach of enforcing version number without caring what's IN
the version excludes the possibility of other implementations and extensions. If
a Korean standards body wanted to take its country range and define its own
local properties for code points within there, that's irrelevant to unicode
committee draft document release versioning procedure appendix formatting
clarification updates (volume III).

The data should not be precious. It's just data. NOT being able to diff it is
suspicious.

>> Heck, if your width options are 0, 1, 2, and 3 (with 3 being "exception, look it
>> up in another table"), all the data except case mapping is one byte per character...
> 
> fwiw, because it's written in terms of icu4c,

An external black box library dependency I don't want to import, and which you
didn't want to include in static binaries. (And the above file list had 3 "icu"
implementations next to each other.)

> which is in turn mostly just exposing the unicode data,

Can we go from "mostly" to "all"? :)

Not that I particularly want to ship a large ascii table either. When I dug into
musl's take on this, I was mostly reverse engineering their compression format
and then going "huh, yeah you probably do want to compress this".

I could generate the table I listed with a C program that runs ispunct() and
similar on every unicode code point and outputs the result. I could then compare
what musl, glibc, and bionic produce for their output. The problem is it's not
authoritative, it's downwind of the "macos is still using 2002 data" issue that
keeps provoking this. :(

> the bionic implementation of wcwidth()
> gives a decent "pseudocode" view of how you'd implement it in terms of
> the unicode data directly:
> https://android.googlesource.com/platform/bionic/+/refs/heads/main/libc/bionic/wcwidth.cpp
> 
> (at least "to the best of my knowledge". since there is no standard,
> and this function most recently changed _yesterday_, i can give no
> guarantee :-) )

That looks like the exception handler wrapper function I was referring to
earlier. :)

None of this seems likely to handle my earlier "widest unicode characters"
thread with the REAL oddball encodings, but none of the current ones do either
and that's ok. Just acknowledging that there needs to BE a special case
exception list is the first step to having a GOOD special case exception list
that can include that sort of thing. (And have all the arguments about excluding
stuff to keep it down to a dull roar...)

I.E. if the table of standard data can't cover everything it shouldn't try to,
so what's the sane subset we CAN cleanly automate?

Rob


More information about the Toybox mailing list