[Toybox] utf8 (was Re: musl intentionally broke chrt)

Rob Landley rob at landley.net
Fri Sep 1 01:08:42 PDT 2017


On 08/31/2017 04:01 PM, enh wrote:
>>> didn't you get in to utf8 because of my wc -m patch? :-)
>>
>> Working on it. It's one of those "I'd like to do what I consider the
>> _proper_ fix" things that's honestly been a bit of a luxury these days.
>>
>> I wrote a for loop to go from 0 to UINT_MAX, and I'm comparing the
>> mbrtowc(&wc, str, 4, &mb) results to my contextless utf8towc(&wc, str,
>> len) output, and I'm fixing every deviation between the two. I'm
>> currently trying to figure out why 0xeda080 _isn't_ 0xd800. (glibc
>> translates wc 0xd800 as f8a08a83 but it's less than ffff so
>> https://en.wikipedia.org/wiki/UTF-8 says it should be 3 bytes and I'm
>> CONFUSED...)
> 
> U+d800 is a surrogate, so shouldn't be valid in utf8.

Still dunno what a surrogate is but I read more of the wikipedia page
and while utf8 is simple, unicode is insane.

Now I'm up to f5 80 80 80 parsing 4 bytes to produce 0x140000 (according
to glibc) but wikipedia[citation needed] says the last code point is
0x10ffff _and_ that 245 (f5) is never the first byte in a valid sequence.

Meanwhile over on musl, f4 90 80 80 is parsing to 0x110000 and again,
that's > 0x10ffff. And I tried the bionic ndk I have lying around
(/opt/android/AndroidVersion.txt says 3.8.275480) and efbfbe is failing
to be fffe.

Rob

P.S. Ongoing terrible test program attached. (When you run it on a
little endian system you have to reverse the bytes in the output when it
reports an error...)

P.P.S. The unindented lines are debug lines, makes 'em easy to strip
back out again.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test2.c
Type: text/x-csrc
Size: 1632 bytes
Desc: not available
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20170901/7fe7e7c3/attachment.c>


More information about the Toybox mailing list