[Toybox] Does anyone here understand how unicode combining characters work?

Rich Felker dalias at libc.org
Wed Sep 26 12:39:06 PDT 2018


On Wed, Sep 26, 2018 at 12:21:46PM -0700, enh wrote:
> in general ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt is
> pretty useful too. iirc plan9 had a code point lookup tool, but
> honestly i mainly type U+xxxx into Google and end up at
> https://www.fileformat.info/info/unicode/char/2028/index.htm.
> 
> the wcwidth stuff isn't well defined (in that it's not a Unicode
> notion, and is under-specified by POSIX) but Unicode does have the

This is true; it's only defined by convention between implementations
and terminal emulators, and without their agreement, everything
breaks.

> "east asian width" data. see
> ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt for that.
> 
> the Unicode FAQs are often helpful too.
> http://unicode.org/faq/char_combmark.html
> 
> plus the full standard is freely available:
> http://www.unicode.org/versions/Unicode11.0.0/

Generally, implementations agree that characters with East Asian Width
property full or wide are wcwidth==2, and character classes Mn or Mc
(nonspacing or enclosing combining) are wcwidth==0. There are also a
number of class Cf characters that need to be treated as wcwidth==0
for the associated languages to work on a terminal.

Rich



More information about the Toybox mailing list