[Toybox] Does anyone here understand how unicode combining characters work?

enh enh at google.com
Wed Sep 26 12:21:46 PDT 2018


in general ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt is
pretty useful too. iirc plan9 had a code point lookup tool, but
honestly i mainly type U+xxxx into Google and end up at
https://www.fileformat.info/info/unicode/char/2028/index.htm.

the wcwidth stuff isn't well defined (in that it's not a Unicode
notion, and is under-specified by POSIX) but Unicode does have the
"east asian width" data. see
ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt for that.

the Unicode FAQs are often helpful too.
http://unicode.org/faq/char_combmark.html

plus the full standard is freely available:
http://www.unicode.org/versions/Unicode11.0.0/
On Wed, Sep 26, 2018 at 12:02 PM Rich Felker <dalias at libc.org> wrote:
>
> On Wed, Sep 26, 2018 at 01:48:03PM -0500, Rob Landley wrote:
> > On 09/26/2018 10:28 AM, Rob Landley wrote:
> > > The crunch_str() logic is designed to escape nonprintable stuff and for watch.c
> > > I need to write something that measures output but lets utf8 combining stuff
> > > happen. (And measures tabs. And also parses at least the color change part of
> > > ansi escapes, but we'll burn that bridge when we come to it...)
> > >
> > > Using hexdump and echo -e's hex escapes to try to print minimal bits of the
> > > combining character examples (which cut and paste appears to have horked
> > > somewhat, but you get the idea):
> > >
> > >   $ cat tests/files/utf8/test1.txt
> > >   l̴̗̞̠ȩ̸̩̥ṱ̴͍̻ ̴̲͜ͅt̷͇̗̮h̵̥͉̝e̴̡̺̼ ̸̤̜͜ŗ̴͓͉i̶͉͓͎t̷̞̝̻u̶̻̫̗a̴̺͎̯l̴͍͜ͅ ̵̩̲̱c̷̩̟̖o̴̠͍̻m̸͚̬̘ṃ̷̢͜e̵̗͎̫n̸̨̦̖c̷̰̩͎e̴̱̞̗
> > >   $ echo -e '\xcc\xb4\xcc\x97\xcc\xa0e'
> > >   e
> > >   $ echo -e 'l\xcc\xb4\xcc\x97\xcc\xa0e'
> > >   l̴̗̠e
> > >   $ echo -e '\xcc\xb4\xcc\x97\xcc\xa0ee'
> > >   ee
> > >   $ echo -e 'l\xcc\xb4\xcc\x97\xcc\xa0'
> > >   l̴̗̠
> > >   $ echo -e '\xcc\xb4\xcc\x97\xcc\xa0'
> > >
> > > So there needs to be a character _before_ the combining characters for them to
> > > take effect, but they apply to the character _after_? Even when it's a newline?
> > > (Which still works as a newline, but leaves trailing weirdness?)
>
> Combining characters (at the terminal, any wcwidth==0 characters since
> there is no finer-grained distinction) attach to the
> previous/logical-left character cell.
>
> > But if I have just enough characters to fill a line, the trailing weirdness does
> > _not_ go to the next line (it appears to get discarded), at least on my 80 char
> > xfce Terminal:
> >
> > echo -e
> > 'xxxxxxxxxxxxxxxxxx0123456789091234567890123456789012345678901234567890123456789a\xcc\xb4\xcc\x97\xcc\xa0'
>
> What you should see is:
>
> xxxxxxxxxxxxxxxxxx0123456789091234567890123456789012345678901234567890123456789a̴̗̠
>
> That is, the combining characters should be visible on the 'a' in the
> last cell. I would not be surprised if some terminals get this wrong.
>
> > I should look up what these escape sequences _do_. Hmmm... I could slowly and
> > painfully do that by hand, but really I want a sort of unicode version of
> > "hexdump -C" telling me what the codepoints are. (Ideally combined with a
> > variant of the "ascii" program to then tell me what each one does.) Somebody has
> > to have written this already, but I dunno what to Google for. Hmm...
> >
> > Hey Rich, I'm fiddling with unicode and lost/confused. Know any good tools for this?
>
> Does something like this help?
>
> #include <stdio.h>
> #include <wchar.h>
> #include <wctype.h>
> #include <locale.h>
> int main()
> {
>         setlocale(LC_CTYPE, "");
>         wint_t c;
>         while ((c=getwchar())!=WEOF)
>                 printf("U+%.4X wcwidth=%d\n", c, wcwidth(c));
> }
>
> Rich
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net



More information about the Toybox mailing list