[Toybox] Does anyone here understand how unicode combining characters work?

Rich Felker dalias at libc.org
Wed Sep 26 12:01:58 PDT 2018


On Wed, Sep 26, 2018 at 01:48:03PM -0500, Rob Landley wrote:
> On 09/26/2018 10:28 AM, Rob Landley wrote:
> > The crunch_str() logic is designed to escape nonprintable stuff and for watch.c
> > I need to write something that measures output but lets utf8 combining stuff
> > happen. (And measures tabs. And also parses at least the color change part of
> > ansi escapes, but we'll burn that bridge when we come to it...)
> > 
> > Using hexdump and echo -e's hex escapes to try to print minimal bits of the
> > combining character examples (which cut and paste appears to have horked
> > somewhat, but you get the idea):
> >
> >   $ cat tests/files/utf8/test1.txt
> >   l̴̗̞̠ȩ̸̩̥ṱ̴͍̻ ̴̲͜ͅt̷͇̗̮h̵̥͉̝e̴̡̺̼ ̸̤̜͜ŗ̴͓͉i̶͉͓͎t̷̞̝̻u̶̻̫̗a̴̺͎̯l̴͍͜ͅ ̵̩̲̱c̷̩̟̖o̴̠͍̻m̸͚̬̘ṃ̷̢͜e̵̗͎̫n̸̨̦̖c̷̰̩͎e̴̱̞̗
> >   $ echo -e '\xcc\xb4\xcc\x97\xcc\xa0e'
> >   e
> >   $ echo -e 'l\xcc\xb4\xcc\x97\xcc\xa0e'
> >   l̴̗̠e
> >   $ echo -e '\xcc\xb4\xcc\x97\xcc\xa0ee'
> >   ee
> >   $ echo -e 'l\xcc\xb4\xcc\x97\xcc\xa0'
> >   l̴̗̠
> >   $ echo -e '\xcc\xb4\xcc\x97\xcc\xa0'
> > 
> > So there needs to be a character _before_ the combining characters for them to
> > take effect, but they apply to the character _after_? Even when it's a newline?
> > (Which still works as a newline, but leaves trailing weirdness?)

Combining characters (at the terminal, any wcwidth==0 characters since
there is no finer-grained distinction) attach to the
previous/logical-left character cell.

> But if I have just enough characters to fill a line, the trailing weirdness does
> _not_ go to the next line (it appears to get discarded), at least on my 80 char
> xfce Terminal:
> 
> echo -e
> 'xxxxxxxxxxxxxxxxxx0123456789091234567890123456789012345678901234567890123456789a\xcc\xb4\xcc\x97\xcc\xa0'

What you should see is:

xxxxxxxxxxxxxxxxxx0123456789091234567890123456789012345678901234567890123456789a̴̗̠

That is, the combining characters should be visible on the 'a' in the
last cell. I would not be surprised if some terminals get this wrong.

> I should look up what these escape sequences _do_. Hmmm... I could slowly and
> painfully do that by hand, but really I want a sort of unicode version of
> "hexdump -C" telling me what the codepoints are. (Ideally combined with a
> variant of the "ascii" program to then tell me what each one does.) Somebody has
> to have written this already, but I dunno what to Google for. Hmm...
> 
> Hey Rich, I'm fiddling with unicode and lost/confused. Know any good tools for this?

Does something like this help?

#include <stdio.h>
#include <wchar.h>
#include <wctype.h>
#include <locale.h>
int main()
{
	setlocale(LC_CTYPE, "");
	wint_t c;
	while ((c=getwchar())!=WEOF)
		printf("U+%.4X wcwidth=%d\n", c, wcwidth(c));
}

Rich



More information about the Toybox mailing list