[Toybox] utf8 (was Re: musl intentionally broke chrt)

Rob Landley rob at landley.net
Fri Sep 1 15:05:24 PDT 2017


On 09/01/2017 10:45 AM, enh wrote:
> http://www.unicode.org/faq/utf_bom.html#utf8-4

Horrible windows thing that tries to store unicode in shorts instead of
longs for backwards compatility with the assumption there couldn't
possibly be more than 65535 letters in the world. Yeah, I figured that.
What I can't figure out is why we'd exclude "you leaked utf-16 to the
outside world" from a Linux translation of utf8, and that... still
doesn't explain it? From the linked page:

> CESU-8 is... designed and recommended for use only within products
> requiring this UTF-16 binary collation equivalence. It is not intended
> nor recommended for open interchange.

Oh well.

Meanwhile, http://www.unicode.org/faq/utf_bom.html#utf16-6 says 1114111
which is 0x10ffff, so there's the source of that limit. (Looks like the
standards bodies swore a blood oath not to break windows. I'm assuming
the check they cashed also had a carefully measured number of zeroes.)

Meanwhile, glibc, musl, and bionic all translate stuff differently, and
none of them quite agree with each other:

Musl is matching my output except they cap at 0x11ffff instead of
0x10ffff. (I poked Rich and he said that's a bug and he'll fix it.)

Bionic is A) going up to 0x1fffff, B) refusing to translate efbfbe as
fffe and efbfbf as ffff (it says both are invalid sequences).

Glibc is also capping the output at 0x1fffff, but on top of that it says
sequences like fe808080 are -2 not -1. (The one in ubuntu 14.04 is
anyway, who knows what the current version's doing...)

Rob

P.S. I can't be the first person to test this stuff, can I?

P.P.S. would creating/parsing an intentionally overlong coding for the
d800-dfff "blood oath to windows" space be cheating?


More information about the Toybox mailing list