[Toybox] utf8 (was Re: musl intentionally broke chrt)

enh enh at google.com
Fri Jul 6 17:23:57 PDT 2018


On Fri, Sep 1, 2017 at 3:05 PM Rob Landley <rob at landley.net> wrote:
>
> On 09/01/2017 10:45 AM, enh wrote:
> > http://www.unicode.org/faq/utf_bom.html#utf8-4
>
> Horrible windows thing that tries to store unicode in shorts instead of
> longs for backwards compatility with the assumption there couldn't
> possibly be more than 65535 letters in the world. Yeah, I figured that.
> What I can't figure out is why we'd exclude "you leaked utf-16 to the
> outside world" from a Linux translation of utf8, and that... still
> doesn't explain it? From the linked page:
>
> > CESU-8 is... designed and recommended for use only within products
> > requiring this UTF-16 binary collation equivalence. It is not intended
> > nor recommended for open interchange.
>
> Oh well.
>
> Meanwhile, http://www.unicode.org/faq/utf_bom.html#utf16-6 says 1114111
> which is 0x10ffff, so there's the source of that limit. (Looks like the
> standards bodies swore a blood oath not to break windows. I'm assuming
> the check they cashed also had a carefully measured number of zeroes.)
>
> Meanwhile, glibc, musl, and bionic all translate stuff differently, and
> none of them quite agree with each other:
>
> Musl is matching my output except they cap at 0x11ffff instead of
> 0x10ffff. (I poked Rich and he said that's a bug and he'll fix it.)
>
> Bionic is A) going up to 0x1fffff, B) refusing to translate efbfbe as
> fffe and efbfbf as ffff (it says both are invalid sequences).

finally got around to fixing this:
https://android-review.googlesource.com/c/platform/bionic/+/714149

> Glibc is also capping the output at 0x1fffff, but on top of that it says
> sequences like fe808080 are -2 not -1. (The one in ubuntu 14.04 is
> anyway, who knows what the current version's doing...)
>
> Rob
>
> P.S. I can't be the first person to test this stuff, can I?
>
> P.P.S. would creating/parsing an intentionally overlong coding for the
> d800-dfff "blood oath to windows" space be cheating?


More information about the Toybox mailing list