[Toybox] strlower() bug

Rob Landley rob at landley.net
Sat Jun 1 03:22:56 PDT 2024


On 5/31/24 12:53, enh wrote:
>> Let's see... Ah:
>>
>> https://www.unicode.org/L2/L1999/UnicodeData.html
>>
>> That's a bit long. My suggestion had 9 decimal numbers, this has "IDEOGRAPHIC
>> TELEGRAPH SYMBOL FOR JANUARY" as one of fifteen fields, with "<compat> 0031
>> 6708" being another single field. How nice. (And still extensive warnings that
>> this doesn't cover everything. I think "too much is never enough" was an MTV
>> slogan back in the 1980s? Ah, it's from "The Marriage of Figaro" in 1784.)
> 
> citation needed? (or if you want me to keep trying to think of where
> that or something similar occurs in the libretto, at least tell me
> whether it's an aria or recitative :-) )

Sorry, not the Mozart one. And not the Italian one Mozart based his version on,
but the original french version the Italian one was based on:

https://en.wikipedia.org/wiki/The_Marriage_of_Figaro_(play)

The quote gets translated a few ways out of the 300 year old french:

https://www.oxfordreference.com/display/10.1093/acref/9780191826719.001.0001/q-oro-ed4-00000807

And to clarify again, I mean Wolfgang, not his equally (if not more) talented
sister Maria who toured together with her sibling as child prodigies but was
sidelined as soon as she reached "marriageable age" and had to teach piano for a
living:

https://en.wikipedia.org/wiki/Maria_Anna_Mozart

Some letters from Wolfgang praising her compositions have survived, but her
parents destroyed all her actual sheet music because it had cooties. Next time
people talk about the "great men of history"... Don't get me started about
Einstein's first wife.

>> In ascii, wcwidth() is basically isprint() plus "tab is weird".
>>
>> For unicode, wcwidth() comes into play. The unicode bureaucracy committee being
>> too microsofted to competently provide one is irrelevant to wcwidth() not being
>> needed for ascii.
>>
>> (I also note the assumption of monospaced fonts in all this. Java's
>> fontmetrics() was about measuring pixel counts in non-monospaced fonts, which
>> this doesn't even contemplate.)
> 
> this is why i keep telling you that wcwidth() only really makes sense
> for tty-based stuff. and even there ...

I need to figure out where to wrap lines in command line editing and text
editors and so on. (I have been relieved of duty on vi, but I still need to make
shell command line editing work. Plus fold and so on. And screen, and watch.
Might do a nano-alike at some point. This is already sort of in top...)

> i'm curious whether the
> different terminal emulators actually behave the same in any of the
> interesting cases. (_especially_ when you get to the "that can't
> happen in well-formed text in the language that uses that script"
> cases.)

I have an ANSI probe sequence to ask where the cursor is, but even if I wanted
to be that chatty (and didn't mind that the amount of time it takes to get a
response is arbitrary and variable, with no response actually guaranteed to come
anyway, and other input surrounding the response), if the output's already
wrapped and scrolled the screen since the last time I asked it's bad. And if I
_disable_ screen wrap then A) I dunno if it's truncated the output, B) lots of
other stuff breaks (it's like leaving the screen in raw mode, only SUBTLY wrong,
and yes QEMU does this from time to time and drives bash line editing NUTS,
that's why run-qemu.sh echoes the relevant stop doing that sequence AND mkroot's
init also outputs it)...

Which means I need a wcwidth() to know how many columns the next character will
advance the cursor in the terminal before outputting it.

>> Not that I particularly want to ship a large ascii table either. When I dug into
>> musl's take on this, I was mostly reverse engineering their compression format
>> and then going "huh, yeah you probably do want to compress this".
>>
>> I could generate the table I listed with a C program that runs ispunct() and
>> similar on every unicode code point and outputs the result. I could then compare
>> what musl, glibc, and bionic produce for their output. The problem is it's not
>> authoritative, it's downwind of the "macos is still using 2002 data" issue that
>> keeps provoking this. :(
> 
> i'm really confused that you keep mentioning ascii. if you really mean
> ispunct() here, say, and not iswpunct(),

The difference between them is that ispunct() has always taken an int but the C
committee was cowardly and refused to make it actually respond to the whole
range, so they created a new function to do the same thing.

At least fseeko() can blame LP64 for long and pointer being the same size having
splash damage. (Moore's Law didn't advance the components in a coordinated
manner, we hit the need for >2 gig files ten years before we hit the need for
>4gig system RAM and thus 64 bit registers...)

(I suppose the C committee was fighting IBM and Microsoft for 10 years before
utf8 happened, and then the unicode committee had Microsoft on it and thus
combining characters were placed AFTER the characters they combined with so you
never know when you're done rendering a character until you've started reading
the one AFTER it, which is just insane...)

> then that's a completely
> solved problem --- ispunct() only covers ascii, and there's no
> implementation we've seen that differs from any of the others there.

Because the problem in that part of the data set is well defined and everybody
agrees on what success looks like.


>        The c argument is an int, the value of which the application
>        shall ensure is a character representable as an unsigned char or
>        equal to the value of the macro EOF. If the argument has any
>        other value, the behavior is undefined.

Is this integer punctuation? Yes/no.

>> None of this seems likely to handle my earlier "widest unicode characters"
>> thread with the REAL oddball encodings, but none of the current ones do either
>> and that's ok. Just acknowledging that there needs to BE a special case
>> exception list is the first step to having a GOOD special case exception list
>> that can include that sort of thing. (And have all the arguments about excluding
>> stuff to keep it down to a dull roar...)
>>
>> I.E. if the table of standard data can't cover everything it shouldn't try to,
>> so what's the sane subset we CAN cleanly automate?
> 
> well, the most likely exception you'll encounter isn't about the
> _characters_ it's about the _locale_ you're asking the question for.
> one problem with unification (not just "han") is that you have
> multiple "characters" (in terms of "what do they mean"/"how do they
> behave") mapped to the same codepoint.

_I_ don't, no. I'm using the "C" locale with UTF-8 support.

> (specifically here i'm thinking
> of turkish/azeri i.)

Needing to know the locale to render UNICODE CODE POINTS defeats the purpose of
unicode: what values should I get in the "C" locale with UTF-8 support?
(Congratulations to microsoft for reintroducing the concept of CODE PAGES to
UNICODE, but I'm not humoring them. Too broken for words.)

Maybe the table annotates these as "weird" and our stub exception handler
returns 0 for all their attributes. I'm ok with that. I'm not trying to get
everything right, I'm trying to 80/20 this.

If somebody who isn't me wants to write a big exception handler that cares about
locale for broken characters the standards committee seemingly accepted bribes
to include, fine. Characters existing that the table cannot, by itself, provide
answers for means that if you emit them unescaped into the shell's command line
editing, or in "watch" output, or use them in fields that "ps" or "top" are
trying to align, then stuff may leak out of their boxes and scroll the screen
inappropriately, and I am FINE with that.

But I'm also fine escaping them: lib/utf8.c already has crunch_escape() doing
the "standard escapes" that vi was doing when I first fed it a bunch of weird
values to see how it would cope years ago. I may not get to use them in vi
itself because I'm not writing that, but I can still have line editing and
friends print a variety of escapes for codepoints I can't reliably measure. It's
not pretty, but it means I retain control of where the cursor is, and the data
can even be represented unambiguously with a bit of work.

Being unable to tell ascii from kanji when statically linked is a bigger issue
from where I'm standing.

(P.S. I still need an ANSI escape sequence parser to do all this right, but I
wrote my first one of those in DOS as a teenager. Probably won't do the full
"man 4 console_codes" collection but I can handle a lot and then ^[ the ESC for
sequences I don't recognize in "watch" and "less" and so on...)

>> Rob

Still Rob


More information about the Toybox mailing list