[Toybox] Does anyone here understand how unicode combining characters work?

Rob Landley rob at landley.net
Wed Sep 26 08:28:36 PDT 2018


The crunch_str() logic is designed to escape nonprintable stuff and for watch.c
I need to write something that measures output but lets utf8 combining stuff
happen. (And measures tabs. And also parses at least the color change part of
ansi escapes, but we'll burn that bridge when we come to it...)

Using hexdump and echo -e's hex escapes to try to print minimal bits of the
combining character examples (which cut and paste appears to have horked
somewhat, but you get the idea):

  $ cat tests/files/utf8/test1.txt
  l̴̗̞̠ȩ̸̩̥ṱ̴͍̻ ̴̲͜ͅt̷͇̗̮h̵̥͉̝e̴̡̺̼ ̸̤̜͜ŗ̴͓͉i̶͉͓͎t̷̞̝̻u̶̻̫̗a̴̺͎̯l̴͍͜ͅ ̵̩̲̱c̷̩̟̖o̴̠͍̻m̸͚̬̘ṃ̷̢͜e̵̗͎̫n̸̨̦̖c̷̰̩͎e̴̱̞̗
  $ echo -e '\xcc\xb4\xcc\x97\xcc\xa0e'
  e
  $ echo -e 'l\xcc\xb4\xcc\x97\xcc\xa0e'
  l̴̗̠e
  $ echo -e '\xcc\xb4\xcc\x97\xcc\xa0ee'
  ee
  $ echo -e 'l\xcc\xb4\xcc\x97\xcc\xa0'
  l̴̗̠
  $ echo -e '\xcc\xb4\xcc\x97\xcc\xa0'

So there needs to be a character _before_ the combining characters for them to
take effect, but they apply to the character _after_? Even when it's a newline?
(Which still works as a newline, but leaves trailing weirdness?)

I googled a bit and found out about "zero width joiners" and "zero width
non-joiners" and am now even more confused. (I know about the sequence that
reverses direction, and should test that my reset.c is resetting that, but I'm
willing to call that one pilot error for the moment...)

Rob


More information about the Toybox mailing list