[Toybox] utf8 (was Re: musl intentionally broke chrt)

Fri Sep 1 08:45:31 PDT 2017

http://www.unicode.org/faq/utf_bom.html#utf8-4

On Fri, Sep 1, 2017 at 1:08 AM, Rob Landley <rob at landley.net> wrote:
> On 08/31/2017 04:01 PM, enh wrote:
>>>> didn't you get in to utf8 because of my wc -m patch? :-)
>>>
>>> Working on it. It's one of those "I'd like to do what I consider the
>>> _proper_ fix" things that's honestly been a bit of a luxury these days.
>>>
>>> I wrote a for loop to go from 0 to UINT_MAX, and I'm comparing the
>>> mbrtowc(&wc, str, 4, &mb) results to my contextless utf8towc(&wc, str,
>>> len) output, and I'm fixing every deviation between the two. I'm
>>> currently trying to figure out why 0xeda080 _isn't_ 0xd800. (glibc
>>> translates wc 0xd800 as f8a08a83 but it's less than ffff so
>>> https://en.wikipedia.org/wiki/UTF-8 says it should be 3 bytes and I'm
>>> CONFUSED...)
>>
>> U+d800 is a surrogate, so shouldn't be valid in utf8.
>
> Still dunno what a surrogate is but I read more of the wikipedia page
> and while utf8 is simple, unicode is insane.
>
> Now I'm up to f5 80 80 80 parsing 4 bytes to produce 0x140000 (according
> to glibc) but wikipedia[citation needed] says the last code point is
> 0x10ffff _and_ that 245 (f5) is never the first byte in a valid sequence.
>
> Meanwhile over on musl, f4 90 80 80 is parsing to 0x110000 and again,
> that's > 0x10ffff. And I tried the bionic ndk I have lying around
> (/opt/android/AndroidVersion.txt says 3.8.275480) and efbfbe is failing
> to be fffe.
>
> Rob
>
> P.S. Ongoing terrible test program attached. (When you run it on a
> little endian system you have to reverse the bytes in the output when it
> reports an error...)
>
> P.P.S. The unindented lines are debug lines, makes 'em easy to strip
> back out again.


-- 
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.