[Toybox] unicode.c question.

enh enh at google.com
Mon May 17 17:08:33 PDT 2021


seems reasonable to me, though i'd be tempted to specifically mention utf8
--- there's nothing inherently "invalid" about surrogate pairs, other than
that you shouldn't see them in utf8. (though you will see them in
_modified_ utf8, so Java programmers might still meet them.)

> 0x10ffff seems legitimately just plain "invalid" though.

On Sat, May 15, 2021 at 10:21 AM Rob Landley <rob at landley.net> wrote:

> Elliott, is it worth testing for invalid unicode range in the display, ala:
>
> --- a/toys/other/ascii.c
> +++ b/toys/other/ascii.c
> @@ -44,7 +44,8 @@ static void codepoint(unsigned wc)
>    char *s = toybuf + sprintf(toybuf, "U+%04X : ", wc), *ss;
>    unsigned n, i;
>
> -  if (wc>31 && wc!=127) {
> +  if ((wc>0xd7ff && wc<0xe000) || wc>0x10ffff) s += sprintf(s, "invalid");
> +  else if (wc>31 && wc!=127) {
>      s += n = wctoutf8(ss = s, wc);
>      if (n>1) for (i = 0; i<n; i++) s += sprintf(s, " : %#02x"+2*!!i,
> *ss++);
>    } else s = memcpy(s, (wc==127) ? "DEL" : low+wc*3, 3)+3;
>
>
> Rob
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20210517/0ec8a492/attachment.htm>


More information about the Toybox mailing list