[Toybox] Does anyone here understand how unicode combining characters work?

Rob Landley rob at landley.net
Thu Sep 27 06:53:07 PDT 2018



On 09/26/2018 05:28 PM, Rich Felker wrote:
> On Wed, Sep 26, 2018 at 03:42:25PM -0500, Rob Landley wrote:
>> I think I've got the answer to my question now. what I needed to know was how
>> much I can print before the cursor winds up on the next line (and scrolls the
>> screen if it was at the bottom), and the answer is "print combining characters
>> _after_ the last character, but stop before the next wcwidth>0 character that
>> would overflow the line".
> 
> On a decent terminal (google "magic margins"), you can always print
> the full width of the terminal, even on the last line, so if the
> terminal width is 80, you print until the wcwidth of the next
> character would throw the position strictly over 80 (81 or higher).
> 
> I'm not sure if there are still any non-magic-margin terminals that
> are relevant.

I haven't encountered any, and that's how top works. Nobody's complained yet.

> If so, and if you don't know what row you're on (e.g.
> for shell line editing), you probably just need to stop at 1 column
> less than the width to be safe. You could probably hardcode a list of
> $TERM values for broken terminals though.

It's not $TERM, it's the xterm consuming the output making that decision. $TERM
largely boils down to which ANSI escapes to produce behind the scenes. I don't
think your xterm can even read its child process's environment variables. (Well,
I suppose it could through /proc/$PID/env but I'm unaware of any of them doing
it...)

The whole $TERM nonsense is legacy of physical teletype machines, then "glass
tty" terminals (VT100, TN3270, etc) that emulated them and added bespoke
per-vendor escape sequences. The IBM PC text mode swept the field (to the point
I had an amiga terminal that emulated it for bulletin boards), but "this code
was written and works so nobody's going to throw it out" kept bad legacy
assumptions alive for decades longer than they made any sense.

>> (And tabs. And the other low-ascii stuff that's also handled inconsistently and
>> which I might have watch and less and such filter out and just not print to the
>> tty. It'd be nice if TERM=linux specified consistent behavior here, but it's
>> determined by the terminal display program consuming the output...)
> 
> I think most of this stuff is largely Unicode-agnostic, and is just a
> matter of understanding classic terminal behavior and the idioms for
> dealing with it.

The low-ascii stuff is not related to unicode, yes. But it got swept up in the
unicode changes and behavior changed when unicode support went in. And
unfortunately, terminal programs differ and the Linux ctrl-alt-f1 text mode
terminals differ from the xterms. Haven't tried a frame buffer yet...)

For example, when I do echo -e '\x02\x02\x03\x04x' on xfce xterm, I get 4 square
boxes with digits in (I.E. uni-codepoint has no glyph, doo dah, doo dah)
followed by x. But ctrl-alt-f1 text mode prints nothing and does not advance the
cursor either, I just get the x on the first column. (I even tried "export
TERM=linux" in both and it didn't change the behavior, that's orthogonal.)

Hence filtering some of them out and not printing them if I dunno whether
they'll advance the cursor or not.

> Rich
Going down ratholes most people never noticed the existence of, as usual.

(You wrote your own xterm, what does _it_ do here?)

Rob



More information about the Toybox mailing list