[Toybox] Does anyone here understand how unicode combining characters work?

Rich Felker dalias at libc.org
Wed Sep 26 15:28:30 PDT 2018


On Wed, Sep 26, 2018 at 03:42:25PM -0500, Rob Landley wrote:
> On 09/26/2018 03:00 PM, enh wrote:
> > if anyone's interested, here's how bionic translates from the actual
> > unicode properties to implement wcwidth:
> > https://android.googlesource.com/platform/bionic/+/master/libc/bionic/wcwidth.cpp
> > 
> > (we do this in general so that we can outsource all the actual
> > unicodet data to icu4c, and thereby guarantee consistency for
> > C/C++/Java regardless of which API is actually called.)
> 
> I think I've got the answer to my question now. what I needed to know was how
> much I can print before the cursor winds up on the next line (and scrolls the
> screen if it was at the bottom), and the answer is "print combining characters
> _after_ the last character, but stop before the next wcwidth>0 character that
> would overflow the line".

On a decent terminal (google "magic margins"), you can always print
the full width of the terminal, even on the last line, so if the
terminal width is 80, you print until the wcwidth of the next
character would throw the position strictly over 80 (81 or higher).

I'm not sure if there are still any non-magic-margin terminals that
are relevant. If so, and if you don't know what row you're on (e.g.
for shell line editing), you probably just need to stop at 1 column
less than the width to be safe. You could probably hardcode a list of
$TERM values for broken terminals though.

> (This is the logic I've needed to work out for screen, less, and vi as well. At
> least when they're not doing the force escapes thing.)
> 
> The ansi escape parsing is still a todo item, but I note I wrote my own ansi
> escape parsing direct screen memory writer for DOS as one of my first C programs
> back in 1990. :P
> 
> (And tabs. And the other low-ascii stuff that's also handled inconsistently and
> which I might have watch and less and such filter out and just not print to the
> tty. It'd be nice if TERM=linux specified consistent behavior here, but it's
> determined by the terminal display program consuming the output...)

I think most of this stuff is largely Unicode-agnostic, and is just a
matter of understanding classic terminal behavior and the idioms for
dealing with it.

Rich



More information about the Toybox mailing list