[Toybox] utf8towc(), stop being defective on null bytes

Sun Apr 7 02:03:23 PDT 2024

On 4/6/24 17:48, Oliver Webb via Toybox wrote:
> Heya, looking more at the utf8 code in toybox. The first thing I spotted is that
> utf8towc() and wctoutf8() are both in lib.c instead of utf8.c, why haven't they
> been moved yet, is it easier to track code that way?
The "yet" seems a bit presumptuous and accusatory, but given the title of the
post I suppose that's a given.

I have no current plans to move xstrtol() from lib.c to xwrap() And atolx() is
only called that instead of xatol() because it does suffixes.

The reason it had to go in lib.c back in the day was explained in the commit
that moved it to lib.c:

  https://github.com/landley/toybox/commit/6e766936396e

As for moving it again someday, unnecessarily moving files is churn that makes
the history harder to see, and lib/*.c has never been a strict division (more
"one giant file seems a bit much"). The basic conversion to/from utf8 is
different from caring about the characteristics of unicode code points (which
the rest of utf8.c does), so having it in lib.c makes a certain amount of sense,
and I'm not strongly motivated to change it without a good reason.

It might happen eventually because I'm still not happy with the general unicode
handling design "yet", but that's a larger story.

Way back when there was "interestingtimes.c" for all my not-curses code, but it
was too long to type and mixed together a couple different kinds of things, so I
split it into utf8.c and tty.c both of which were shorter and didn't screw up
"ls" columnization. (I probably should have called it unicode.c instead, but
unicode is icky, the name is longer, and half the unicode stuff is still in libc
anyway).

Unicode is icky because utf8 and unicode are not the same thing. Ken Thompson
came up with a very elegant utf8 design and microsoft crapped all over it (cap
the conversion range, don't add the base value covered by the previous range so
there are no duplicate encodings) for no apparent reason, and then unicode just
plain got nuts. (You had an ENORMOUS encoding space, the bottom bit could
totally have been combining vs physical characters so we don't need a function
to tell, and combining characters should 100% have gone BEFORE the physical
characters rather than after to avoid the whole problem of FLUSHING them, and
higher bits could indicate 1 column vs 2 column or upper/lower/numeric so you
don't have to test with special functions like that, just collage them into
LARGE BLOCKS which is LESS SILLY than the whole "skipping 0xd800" or whatever
that is for the legacy 16 bit windows encoding space that microsoft CRAPPED INTO
THE STANDARD... Ahem.)

But alas, microsoft bought control of the unicode committee, so you need
functions to say what each character is, and those functions are unnecessarily
complicated. In theory libc has code to do wide char conversions already, but
glibc refuses to enable it unless you've installed and selected a utf8-aware
locale (which is just nuts, but that's glibc for you).

I made some clean dependency-free functions to do the simple stuff that doesn't
care what locale you're in, but there's still wcwidth() and friends that depend
on libc's whims (hence the dance to try to find a utf8 locale in main.c, and the
repeated discussion on this list between me and Elliott and Rich Felker about
trying to come up with portable fontmetrics code. Well, column-metrics. Elliott
keeps trying to dissuade me, but bionic's code for this still didn't work static
linked last I checked...)

Moving stuff around between files when I'm not entirely satisfied with the
design (partly depending on libc state and partly _not_ depending on it) doesn't
seem helpful.

> Also, the documentation
> (header comment) should probably mention that they store stuff as unicode codepoints,

Because I consistently attach comments before the function _body_ explaining
what the function does, instead of putting long explanations in the .h files
included from every other file which the compiler would have to churn through
repeatedly. In this case:

  // Convert utf8 sequence to a unicode wide character
  // returns bytes consumed, or -1 if err, or -2 if need more data.
  int utf8towc(unsigned *wc, char *str, unsigned len)

> I spent a while scratching my head at the fact wide characters are 4 byte int's
> when the maximum utf8 single character length is 6 bytes.

Because Microsoft broke utf8 in multiple ways through the unicode consortium,
among other things making 4 bytes the max:

http://lists.landley.net/pipermail/toybox-landley.net/2017-September/017184.html

In addition to the mailing list threads, I thought I blogged about this rather a
lot at the time:

  https://landley.net/notes-2017.html#29-08-2017
  https://landley.net/notes-2017.html#01-09-2017
  https://landley.net/notes-2017.html#19-10-2017

Which was contemporaneous with the above git commit that added the function to
lib/lib.c. I generally find that stuff by going "when did this code show up
and/or get changed" (in this case September 2017) and then checking known data
sources like "the mailing list" and "the developer's blog". (Whether it's my
code or somebody else's, "this happened 7 years ago" tends to require some
digging to get the details right.)

> Another thing I noticed is that if you pass a null byte into utf8towc(), it will
> assign, but will not "return bytes read" like it's supposed to, instead it will
> return 0 when it reads 1 byte.

The same way strlen() doesn't include the null terminator in the length "like
it's supposed to"? Obviously stpcpy() is defective to write a null terminator
and then return a pointer to that null terminator, instead of returning the
first byte it didn't modify "like it's supposed to"...

An assertion is not the same as a question.

> Suppose you have a function that turns a character string into a array of "wide characters",> this is easily done by a while loop keeping a index for the old character
string and the new
> wide character string. So you should just be able to "while (ai < len) ai += utf8towc(...",
> the problem?

Again with the "should". This is in use in 9 commands outside pending, obviously
it CAN BE used.

Returning length 0 means we hit a null terminator, but returning <0 means there
was a problem preventing conversion. Both require handling, so "+=" without
checking the return value means missing error handling.

The reason calling utf8towc() with length 0 returns -2 (need more data) instead
of 0 is that 0 is ONLY returned when it hits a NUL terminator. If you run out of
data without hitting a null terminator, that's not the same thing and maybe you
want to refill your buffer instead of prematurely ending the string. (Yes, I
thought about it.) If you don't want to track the length and know you're working
on a null terminated string, feeding in length 4 works because that's the
longest utf8 sequence unicode allows, due to the maximum possible value being
truncated BY MICROSOFT so it doesn't outshine their horrible legacy format:

https://stackoverflow.com/questions/27415935/does-unicode-have-a-defined-maximum-number-of-code-points

Knowing your string is NUL terminated means it can't read off the end because
any value <128 will terminate a unicode sequence, with error as necessary. (The
first byte indicates how many following bytes to expect, but all unicode
sequences have the high bit set, and everything after the first has 10 as the
first two bits. So an unexpected NUL terminator is an encoding error returning
-1, so it can't read past the end of a null terminated string no matter how long
you say the length is. This also allows you to traverse utf8 sequences backwards
by skipping anything with 10 in the top two bits: everything else is a codepoint
start or ascii value).

Rob