[Toybox] utf8towc(), stop being defective on null bytes

Oliver Webb aquahobbyist at proton.me
Sun Apr 7 07:42:45 PDT 2024


On Sunday, April 7th, 2024 at 03:54, Rob Landley <rob at landley.net> wrote:

> As for moving it again someday, unnecessarily moving files is churn that makes
> the history harder to see, and lib/*.c has never been a strict division (more
> "one giant file seems a bit much"). The basic conversion to/from utf8 is
> different from caring about the characteristics of unicode code points (which
> the rest of utf8.c does), so having it in lib.c makes a certain amount of sense,
> and I'm not strongly motivated to change it without a good reason.
> 
> It might happen eventually because I'm still not happy with the general unicode
> handling design "yet", but that's a larger story.

Eh, they're utf8 functions, utf8 functions being in the file named "utf8.c" makes
more sense from my perspective.

I was also planning on doing some form of a documentation write up in code.html
about, among other things, the utf8 functions. That stopped when I realized
that would mean documenting all of the eighty-something functions in lib.c.

> (I probably should have called it unicode.c instead, but
> unicode is icky, the name is longer, and half the unicode stuff is still in libc
> anyway).
> 
> Unicode is icky because utf8 and unicode are not the same thing.

If it's handling unicode instead of utf8 and the 2 are noticeably different,
I don't see why a file for unicode stuff should be called utf8.c.

> Because Microsoft broke utf8 in multiple ways through the unicode consortium,
> among other things making 4 bytes the max:

I have to ask, if you disagree with the decision to cap utf8 to only a million codepoints,
and not complying with that only means that anyone who wants to pass unicode codepoints over
U+10FFFF to toybox code will be able to. Why have code make sure we comply with an insane
microsoft decision when we don't (I don't think?) have to:

  // Limit unicode so it can't encode anything UTF-16 can't.
  if (result>0x10ffff || (result>=0xd800 && result<=0xdfff)) return -1;

> > Another thing I noticed is that if you pass a null byte into utf8towc(), it will
> > assign, but will not "return bytes read" like it's supposed to, instead it will
> > return 0 when it reads 1 byte.
> 
> The same way strlen() doesn't include the null terminator in the length "like
> it's supposed to"? Obviously stpcpy() is defective to write a null terminator
> and then return a pointer to that null terminator, instead of returning the
> first byte it didn't modify "like it's supposed to"...
> 
> An assertion is not the same as a question.

If I'm going my the comment over the function body ("This returns bytes read unless error"),
then yes, that is what "it's supposed to do", we have read one byte of input, and written it
successfully to our return destination. A special case for null bytes is fine, but to save
me and any other person that debugging nightmare when they try to do utf8 processing on data
with null bytes in it. I'd prefer if that was mentioned somewhere.

A bug only becomes a feature when you declare it is, and "undocumented special case"
is another way to say "landmine".

> Returning length 0 means we hit a null terminator,

Null bytes aren't always "terminators". You can embed null bytes into data and still
want to do utf8 processing with it.

> due to the maximum possible value being truncated BY MICROSOFT so it doesn't outshine their horrible legacy format:

"BY MICROSOFT", and by you. https://github.com/landley/toybox/blob/master/lib/lib.c#L189.
Do we need to do that for any reason other then to comply to microsoft and the unicode commite?
The linux kernel is agnostic to filenames having "good utf8". Should utf8towc (I don't think
wctoutf8 has this restriction) be agnostic towards "good unicode" when it's utf8 we are processing,
and delegate that job to the fontmetrics code? Again, it's utf8 we are handling with these,
not unicode, even if the 2 are linked.

"And even then it might be the wrong thing to disallow clever
people from doing clever things. Encoding other information in filenames
might be proper for a number of applications."
- Linus Torvalds, https://yarchive.net/comp/linux/utf8.html

-   Oliver Webb <aquahobbyist at proton.me>



More information about the Toybox mailing list