[Toybox] [PATCH] ls: Ensure file names are separated by 2 spaces

Sun Oct 27 14:54:14 PDT 2019

On 10/25/19 1:00 AM, Jarno Mäkipää wrote:
> Hey Andrew
> 
> Relating to combining char "issue" did you see patch I send to list.
> 
> http://lists.landley.net/pipermail/toybox-landley.net/2019-October/011076.html
> 
> I would say that crunch_str() on lib cuts strings with combining chars
> already works correctly, and properly implemented terminal client like
> xterm renders them ok.

I remembered that there were issues, I didn't remember where they were. Your
recent patch to switch stuff _back_ to crunch_str() fixed one of the kind of
thing I was thinking of, but I don't remember how many there are in the code at
this point.

> And even if user has broken terminal combining chars should be pushed
> after main glyph, since there might be use case that user wants to
> clip his program output at 80 columns and push it to file and later
> render file with program that works such as web browser....

Which is terrible design because you never know you're done until you've read
_too_much_ and then have to ungetc() multiple characters. (Utf-8 is great.
Unicode is insane.)

> Then again there is probably lots of toys in toybox that does not use
> crunch_str and have there own logic where to clip string.... and need
> lots of testing.

I try to get everything to use lib functions, but with trailing combining
characters it's unavoidable that the caller needs to be aware of this because "I
sent the 4k buffer I read and it says there are 863 columns in there and it
ended on an even character boundary" does NOT mean you don't still have to feed
the start of the next 4k you read into the thing to see if you were done with
that last character. Grrr...

crunch_str() returns width in columns and moves *str to the end of the data it
consumed, but that doesn't answer the question "are we done with the last
character, or is there more data coming that's part of _this_ character". Which
means utf8skip() can't provide a definitive answer, its return needs a
"definitely" vs "maybe" indicator that says there _could_ be more data after
this. (Or the caller needs to check that it's pointing at the NUL terminator,
except when it _isn't_ is that an incomplete character that may be completed
into a combining character by more input, or is it a non-combining character?)

Anyway, now I understand why there's so much unicode software out there that
doesn't get this stuff right. Unicode is badly designed.

Rob