[Toybox] utf8towc(), stop being defective on null bytes

Mon Apr 8 09:01:43 PDT 2024

On Sun, Apr 7, 2024 at 7:43 AM Oliver Webb via Toybox
<toybox at lists.landley.net> wrote:
>
> On Sunday, April 7th, 2024 at 03:54, Rob Landley <rob at landley.net> wrote:
>
> > As for moving it again someday, unnecessarily moving files is churn that makes
> > the history harder to see, and lib/*.c has never been a strict division (more
> > "one giant file seems a bit much"). The basic conversion to/from utf8 is
> > different from caring about the characteristics of unicode code points (which
> > the rest of utf8.c does), so having it in lib.c makes a certain amount of sense,
> > and I'm not strongly motivated to change it without a good reason.
> >
> > It might happen eventually because I'm still not happy with the general unicode
> > handling design "yet", but that's a larger story.
>
> Eh, they're utf8 functions, utf8 functions being in the file named "utf8.c" makes
> more sense from my perspective.
>
> I was also planning on doing some form of a documentation write up in code.html
> about, among other things, the utf8 functions. That stopped when I realized
> that would mean documenting all of the eighty-something functions in lib.c.
>
> > (I probably should have called it unicode.c instead, but
> > unicode is icky, the name is longer, and half the unicode stuff is still in libc
> > anyway).
> >
> > Unicode is icky because utf8 and unicode are not the same thing.
>
> If it's handling unicode instead of utf8 and the 2 are noticeably different,
> I don't see why a file for unicode stuff should be called utf8.c.
>
> > Because Microsoft broke utf8 in multiple ways through the unicode consortium,
> > among other things making 4 bytes the max:
>
> I have to ask, if you disagree with the decision to cap utf8 to only a million codepoints,
> and not complying with that only means that anyone who wants to pass unicode codepoints over
> U+10FFFF to toybox code will be able to. Why have code make sure we comply with an insane
> microsoft decision when we don't (I don't think?) have to:
>
>   // Limit unicode so it can't encode anything UTF-16 can't.
>   if (result>0x10ffff || (result>=0xd800 && result<=0xdfff)) return -1;
>
> > > Another thing I noticed is that if you pass a null byte into utf8towc(), it will
> > > assign, but will not "return bytes read" like it's supposed to, instead it will
> > > return 0 when it reads 1 byte.
> >
> > The same way strlen() doesn't include the null terminator in the length "like
> > it's supposed to"? Obviously stpcpy() is defective to write a null terminator
> > and then return a pointer to that null terminator, instead of returning the
> > first byte it didn't modify "like it's supposed to"...
> >
> > An assertion is not the same as a question.
>
> If I'm going my the comment over the function body ("This returns bytes read unless error"),
> then yes, that is what "it's supposed to do", we have read one byte of input, and written it
> successfully to our return destination. A special case for null bytes is fine, but to save
> me and any other person that debugging nightmare when they try to do utf8 processing on data
> with null bytes in it. I'd prefer if that was mentioned somewhere.
>
> A bug only becomes a feature when you declare it is, and "undocumented special case"
> is another way to say "landmine".
>
> > Returning length 0 means we hit a null terminator,
>
> Null bytes aren't always "terminators". You can embed null bytes into data and still
> want to do utf8 processing with it.

that's questionable ... the desire to have ASCII NUL in utf-8
sequences (without breaking the "utf-8 sequences are usable as c
strings" property) is the main reason for the existence of "modified
utf-8".

> > due to the maximum possible value being truncated BY MICROSOFT so it doesn't outshine their horrible legacy format:
>
> "BY MICROSOFT", and by you. https://github.com/landley/toybox/blob/master/lib/lib.c#L189.
> Do we need to do that for any reason other then to comply to microsoft and the unicode commite?
> The linux kernel is agnostic to filenames having "good utf8". Should utf8towc (I don't think
> wctoutf8 has this restriction) be agnostic towards "good unicode" when it's utf8 we are processing,
> and delegate that job to the fontmetrics code? Again, it's utf8 we are handling with these,
> not unicode, even if the 2 are linked.
>
> "And even then it might be the wrong thing to disallow clever
> people from doing clever things. Encoding other information in filenames
> might be proper for a number of applications."
> - Linus Torvalds, https://yarchive.net/comp/linux/utf8.html
>
> -   Oliver Webb <aquahobbyist at proton.me>
>
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net