[Toybox] utf8towc(), stop being defective on null bytes

Oliver Webb aquahobbyist at proton.me
Sat Apr 6 15:48:20 PDT 2024


Heya, looking more at the utf8 code in toybox. The first thing I spotted is that
utf8towc() and wctoutf8() are both in lib.c instead of utf8.c, why haven't they
been moved yet, is it easier to track code that way? Also, the documentation
(header comment) should probably mention that they store stuff as unicode codepoints,
I spent a while scratching my head at the fact wide characters are 4 byte int's
when the maximum utf8 single character length is 6 bytes.

Another thing I noticed is that if you pass a null byte into utf8towc(), it will
assign, but will not "return bytes read" like it's supposed to, instead it will
return 0 when it reads 1 byte. This is because we collapse the return value for ascii
characters down into 1 _or 0_ with !!(*a = *b). When "|| 1" would collapse the value to 1.

Suppose you have a function that turns a character string into a array of "wide characters",
this is easily done by a while loop keeping a index for the old character string and the new
wide character string. So you should just be able to "while (ai < len) ai += utf8towc(...",
the problem? If you hit a null byte the code goes into an infinite loop. This can be solved
by a ternary operator or some other checking, but fixing utf8towc() to do the _right_ thing
seems more sensible (We have read one byte and wrote it successfully).

-   Oliver Webb <aquahobbyist at proton.me>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-utf8towc-return-1-on-null-byte-instead-of-0.patch
Type: text/x-patch
Size: 687 bytes
Desc: not available
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20240406/56239735/attachment.bin>


More information about the Toybox mailing list