[Toybox] Does anyone here understand how unicode combining characters work?

enh enh at google.com
Wed Sep 26 13:00:06 PDT 2018


if anyone's interested, here's how bionic translates from the actual
unicode properties to implement wcwidth:
https://android.googlesource.com/platform/bionic/+/master/libc/bionic/wcwidth.cpp

(we do this in general so that we can outsource all the actual
unicodet data to icu4c, and thereby guarantee consistency for
C/C++/Java regardless of which API is actually called.)
On Wed, Sep 26, 2018 at 12:39 PM Rich Felker <dalias at libc.org> wrote:
>
> On Wed, Sep 26, 2018 at 12:21:46PM -0700, enh wrote:
> > in general ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt is
> > pretty useful too. iirc plan9 had a code point lookup tool, but
> > honestly i mainly type U+xxxx into Google and end up at
> > https://www.fileformat.info/info/unicode/char/2028/index.htm.
> >
> > the wcwidth stuff isn't well defined (in that it's not a Unicode
> > notion, and is under-specified by POSIX) but Unicode does have the
>
> This is true; it's only defined by convention between implementations
> and terminal emulators, and without their agreement, everything
> breaks.
>
> > "east asian width" data. see
> > ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt for that.
> >
> > the Unicode FAQs are often helpful too.
> > http://unicode.org/faq/char_combmark.html
> >
> > plus the full standard is freely available:
> > http://www.unicode.org/versions/Unicode11.0.0/
>
> Generally, implementations agree that characters with East Asian Width
> property full or wide are wcwidth==2, and character classes Mn or Mc
> (nonspacing or enclosing combining) are wcwidth==0. There are also a
> number of class Cf characters that need to be treated as wcwidth==0
> for the associated languages to work on a terminal.
>
> Rich



More information about the Toybox mailing list