[Toybox] Does anyone here understand how unicode combining characters work?

Wed Sep 26 13:59:16 PDT 2018

On 09/26/2018 02:01 PM, Rich Felker wrote:
> On Wed, Sep 26, 2018 at 01:48:03PM -0500, Rob Landley wrote:
>> On 09/26/2018 10:28 AM, Rob Landley wrote:
>>> The crunch_str() logic is designed to escape nonprintable stuff and for watch.c
>>> I need to write something that measures output but lets utf8 combining stuff
>>> happen. (And measures tabs. And also parses at least the color change part of
>>> ansi escapes, but we'll burn that bridge when we come to it...)
>>>
>>> Using hexdump and echo -e's hex escapes to try to print minimal bits of the
>>> combining character examples (which cut and paste appears to have horked
>>> somewhat, but you get the idea):
>>>
>>>   $ cat tests/files/utf8/test1.txt
>>>   l̴̗̞̠ȩ̸̩̥ṱ̴͍̻ ̴̲͜ͅt̷͇̗̮h̵̥͉̝e̴̡̺̼ ̸̤̜͜ŗ̴͓͉i̶͉͓͎t̷̞̝̻u̶̻̫̗a̴̺͎̯l̴͍͜ͅ ̵̩̲̱c̷̩̟̖o̴̠͍̻m̸͚̬̘ṃ̷̢͜e̵̗͎̫n̸̨̦̖c̷̰̩͎e̴̱̞̗
>>>   $ echo -e '\xcc\xb4\xcc\x97\xcc\xa0e'
>>>   e
>>>   $ echo -e 'l\xcc\xb4\xcc\x97\xcc\xa0e'
>>>   l̴̗̠e
>>>   $ echo -e '\xcc\xb4\xcc\x97\xcc\xa0ee'
>>>   ee
>>>   $ echo -e 'l\xcc\xb4\xcc\x97\xcc\xa0'
>>>   l̴̗̠
>>>   $ echo -e '\xcc\xb4\xcc\x97\xcc\xa0'
>>>
>>> So there needs to be a character _before_ the combining characters for them to
>>> take effect, but they apply to the character _after_? Even when it's a newline?
>>> (Which still works as a newline, but leaves trailing weirdness?)
> 
> Combining characters (at the terminal, any wcwidth==0 characters since
> there is no finer-grained distinction) attach to the
> previous/logical-left character cell.
> 
>> But if I have just enough characters to fill a line, the trailing weirdness does
>> _not_ go to the next line (it appears to get discarded), at least on my 80 char
>> xfce Terminal:
>>
>> echo -e
>> 'xxxxxxxxxxxxxxxxxx0123456789091234567890123456789012345678901234567890123456789a\xcc\xb4\xcc\x97\xcc\xa0'
> 
> What you should see is:
> 
> xxxxxxxxxxxxxxxxxx0123456789091234567890123456789012345678901234567890123456789a̴̗̠
> 
> That is, the combining characters should be visible on the 'a' in the
> last cell. I would not be surprised if some terminals get this wrong.

The xfce terminal shows all the data on the character to the right.

Thunderbird sticks ~ characters in between stuff, but shows the sub-whatsis
(cedilla?) under the character to the right.

I pulled up the web archive in chrome on a windows box at work and it's... sort
of doing both? The second example on in the list ("le") is showing the
under-apostrophe under the l but has some sort of overstrike through the E, and
the next to last one has l with an under-apostrophe but then a tilde after it.

Ahem: Wheee.

>> Hey Rich, I'm fiddling with unicode and lost/confused. Know any good tools for this?
> 
> Does something like this help?
> 
> #include <stdio.h>
> #include <wchar.h>
> #include <wctype.h>
> #include <locale.h>
> int main()
> {
> 	setlocale(LC_CTYPE, "");
> 	wint_t c;
> 	while ((c=getwchar())!=WEOF)
> 		printf("U+%.4X wcwidth=%d\n", c, wcwidth(c));
> }
> 
> Rich

It's a start, thanks.

I was hoping there was an existing thing, but I can probably just stick that in
the toys/example directory next to the utf8 range tester.

Still need to google the U+ to see what that character does, which is awkward
because my phone tethering can get really intermittent downtown inside tall
brick buildings. Whether a web page decides to load any given minute is potluck,
there's some sort of signal-nosignal-signal interference pattern on my desk in
about 2 inch intervals that move over the course of the day as the sun changes
position, and finding where has signal now can take a bit because the baseband
processer gets confused and _thinks_ it has signal when it doesn't then takes a
while to resync with the tower. Sometimes I have to reboot it...

Hmmm, it seems I need to make it parse
https://unicode.org/Public/11.0.0/ucd/UnicodeData.txt ...

Rob