[Toybox] strlower() bug

Wed May 29 03:01:38 PDT 2024

On 5/22/24 09:30, enh wrote:
> On Tue, May 14, 2024 at 2:58 PM Rob Landley <rob at landley.net> wrote:
>> It looks like macos towlower() refuses to return expanding unicode characters.
>> Possibly to avoid exactly the kind of bug this fixed, in exchange for corrupting
>> the data.
> 
> yeah, i don't know whether it's on purpose or a bug, but that does
> seem to be the case... i tested with another Latin Extended-B
> character whose uppercase and lowercase forms are both in the same
> block (and thus have the same utf8 encoding length), and macOS
> towlower() does work for that.
> 
> hmm, actually maybe it's just that their Unicode data is out of date?
> it looks like they don't know about Latin Extended-C at all? a code
> point like U+2c62 that gets _smaller_ (because it's in the IPA
> Extensions block) doesn't work either.
> 
> i did try looking in FreeBSD, but i've never understood how this stuff
> works there.

FreeBSD questions go to Ed Maste <emaste at freebsd.org> who is theoretically
subscribed here but keeps getting unsubscribed by gmail bounces.

> i'm guessing from the fact i've never found them that the
> implementations are all generated at build time, subtly enough that my
> attempts to grep for the generators fail.
> 
> hmm... looking at Apple's online FreeBSD code, it looks like they have
> very different (presumably older) FreeBSD code
> [https://opensource.apple.com/source/Libc/Libc-320.1.3/locale/FreeBSD/tolower.c.auto.html],
> and the footer of the file that reads implies that they're using data
> from Unicode 3.2 (released in 2002, which would make sense given the
> 2002 BSD copyright date in the tolower.c source):

Sigh, can't they just ship machine consumable bitmaps or something? I can have
my test plumbing pull "standards" files, ala:

https://github.com/landley/toybox/blob/master/mkroot/packages/tests

But an organization shipping a PDF or 9 interlocking JSON files with a turing
complete stylesheet doesn't help much.
> so, yeah, i don't think there was anything clever or mysterious going
> on here --- macOS is just using Unicode data from 22 years ago. (which
> is an amusing real-world example of why i keep saying "you probably
> don't want to get into the business of redistributing Unicode data; it
> changes every year" :-) )

A youtuber named Ryan McBeth is fond of explaining the difference between a
"problem" and a "dilemma". A problem has an obvious solution, which may be
painful or expensive but there's not a lot of disagreement on what success looks
like. A dilemma has multiple ways to address it, each of which has something
uniquely wrong with it. Problems don't lead to indecision, dilemmas do (and thus
accumulate).

In this case, the dilemma is "trusting libc to get it wrong differently in each
new environment" vs "taking a large expense onboard with borderline xkcd
violation". (If there is an xkcd strip explaining why not to do something, you
probably shouldn't do it. In this case https://xkcd.com/927/ )

Which is _sad_ because there's only a dozen ispunct() variants that read a bit
out of a bitmap (and haven't significantly changed since K&R: neither isblank()
nor isascii() is worth the wrapper), plus a toupper/tolower pair that map
integers with "no change" being the common case. Plus unicode has wcwidth().
Yes, it's over a (sparse!) table with space for a million entries, but CSV
encoding all that data in human+machine readable ASCII should gzip down to what,
500k?

Let's see, the bits seem to be alpha, cntrl, digit, punct, and space, and then
width (mostly 0, 1, or 2 but we've talked about exceptions), and two translation
codepoints for toupper and tolower.

You can easily derive isalnum() and isxdigit(), and isascii() and isblank() are
trivial according to the man page. If the table has upper and lower mappings
(I.E. what character this turns into, zero if it doesn't) then you don't need
isupper() or islower() bits unless there's cases where "this isn't upper case
but can be converted to lower case" (which aren't covered by having BOTH
toupper() and tolower() mappings for the same character).

I'm honestly unclear on what "isgraph" does, "any printable character except
space"... if isprint() means "not width 0" then that's just adding && !isspace()
so doesn't need to be in the table.

So code, alpha, cntrl, digit, punct, space, width, upper, lower. Something like:

0,0,0,0,0,0,0,0,0
13,0,1,0,0,1,0,0,0
32,0,0,0,0,1,1,0,0
57,0,0,1,0,0,1,0,0
58,0,0,0,1,0,1,0,0
65,1,0,0,0,0,1,0,97

No, that doesn't cover weird stuff like the right-to-left gearshift or the
excluded mapping ranges or even the low ascii characters having special effects
like newline and tab, but those aren't really "characters" are they? Special
case the special cases, don't try to represent them in a table like that beyond
what ispunct() and toupper() and friends should return. (Maybe have a -1 width
for "weird".)

But again, that's my dunning-kruger talking. I don't see WHY it's so
complicated. Arguing about efficient representation isn't the same as arguing
about "this is the data, it should be easy to diff new releases against the
previous release to see what changed, so why don't they publish this?"

Heck, if your width options are 0, 1, 2, and 3 (with 3 being "exception, look it
up in another table"), all the data except case mapping is one byte per character...

[Have pondered for a while. Dunno what else to say, so pressing send.]

Rob