[Toybox] [PATCH] tr: fix pathological flushing.

Tue Jan 5 23:05:57 PST 2021

On 1/5/21 7:17 PM, enh wrote:
> On Sat, Dec 26, 2020 at 11:09 AM Rob Landley <rob at landley.net
> <mailto:rob at landley.net>> wrote:
> 
>     On 12/5/20 1:32 PM, enh wrote:
>     > (you'd expect to see '東해 물과 백두산이' if this actually worked: change from
>     > the hangeul for "east" to the hanja for "east" but leave everything else
>     alone.)
>     >
>     > xxd confirms it's just screwing up bytes:
> 
>     Oh I know the others are all doing bytes, but that doesn't make it the right
>     thing to do. :)
> 
>     > ~$ echo '동해 물과 백두산이' |  xxd
>     > 00000000: eb8f 99ed 95b4 20eb acbc eab3 bc20 ebb0  ...... ...... ..
>     > 00000010: b1eb 9190 ec82 b0ec 9db4 0a              ...........
>     > ~$ echo '동해 물과 백두산이' | tr '동' '東' | xxd
>     > 00000000: e69d b1ed 95b4 20e6 acbc eab3 bc20 e6b0  ...... ...... ..
>     > 00000010: b1e6 9190 ec82 b0ec 9db4 0a              ...........
>     >
>     > 동 is 0xeb 0x8f 0x99 and the equivalent Chinese character 東 is 0xe6 0x9d
>     0xb1,
>     > and you can see from those hex dumps that what tr did was replace 0xeb with
>     > 0xe6, 0x8f with 0x9d, and 0x99 with 0xb1. this did the right thing by accident
>     > for 동 but mangled other characters that contained any of those bytes.
>     >
>     > and although philosophically i'm usually on board with your "all times are
>     ISO,
>     > all text is UTF8", i'm really not sure it makes much sense to even *try* to
>     > support this in tr. why? because i think it opens the i18n/l11n can of worms
>     > again. if you think about non-binary uses of tr, they're often stuff like
>     > "convert to all caps", but are we going to get that right for Turkish/Azeri
>     > dotted/dotless 'i's, Greek final/non-final sigma, etc? are we going to
>     have tr's
>     > behavior then depend on your locale?
> 
>     I've been handing that off to libc, which I'm aware bionic does not currently
>     do,
> 
> to be clear, *no* C library can get all of those examples right because the C
> APIs are broken as designed. you genuinely need something like icu4c for this.
> that's *why* bionic only supports the C locale.

Sadly, this is one of the things I've wanted to "study until it makes sense and
I can understand what the right thing to do is", and it's one where staring at
it has made it worse.

Yeah combining characters are out of scope for tr but I thought "map one unicode
point to another unicode point" was a reasonable goal for tr? You're making a
fairly strong case that it isn't. (I knew the unicode consortium was insane, but
I'd assumed a base layer of competency that had gotten crapped upon by microsoft
and IBM joining the committee, which I might be able to dig down to reach.
Sounds like it's turtles all the way...)

> but, yeah, in retrospect i shouldn't have led with an example that libc *can*
> get right and followed up with one it can't. the other way round would
> definitely have been better :-)
> 
> but i can explain why i did that: we've had a ton of trouble with the
> dotted/dotless 'i' in Java because you really have to think about whether you
> want that behavior. and it's often quite non-obvious because it doesn't just
> depend on your locale, it depends on your *input*. if i'm a Turk changing the
> case of some Turkish text, yes, i want that locale-specific behavior. if i'm a
> Turk who's just trying to build AOSP, i don't, because it's going to corrupt ASCII.

Sigh. I think I need to bow to your expertise here. You at least have real-world
experience with the mess via the bionic and dalvik library bug reports.

Is this _just_ a problem with case mapping? (Can we say [:lower:] and [:upper:]
are the relevant 26 ascii characters and _otherwise_ accept a unicode point
input at each end as one character?

> my claim here is that "most Turks^wpeople using tr(1) are programmers trying to
> build something" :-)
>  
>     you changed the main.c intro code to not pass along environmental utf8-ness,
>     I changed it back and have another pending commit from you (which still isn't
>     going to make a difference on bionic yet anyway... :)
> 
> that's unrelated. that's some weird macOS-only thing about what they call
> their locales.

Your updated one is applied.

>     > are we going to deal with combining
>     > characters too, or do i have to specify all the ways you can write "ö" to get
>     > "Freude, schöner Götterfunken" right (because without a hex dump, neither you
>     > nor i know whether those two 'ö's were encoded the same way)?
> 
>     No, I was just going to substitute individual utf8 sequences. There comes a
>     point where you switch to "sed"...
> 
> so what's the point? you can't get this right, and no-one cares anyway, but it
> makes things slower and more complicated because... ? who are you, and what have
> you done to the real rob? :-)

Buried in shell logic, giving everything else combined like 1/4 voltage. (And
shell has to care about this crap too because bash has ${PATH^^*} and friends.
Wildcard matching has [:upper:] and variables have declare -u.)

Did you know you can have exported local variables? I'm not sure who to argue
with about that, but I expect it would require time travel to the mid-1980's. (I
figured out how to make it work, I just object to NEEDING to at a conceptual level.)

>     > amusingly, the Plan9 man page only gave one example of using tr(1), and it was
>     > converting ASCII upper/lower. so i don't think they had any _use_ for it
>     either,
>     > they just wrote everything in terms of runes.
> 
>     They were english-only white guys in the early 90's. Good intentions only go so
>     far if it never leaves the lab and hits real world data.
> 
> ...which is why the C API is broken for this, and -- unless you're going to
> accept icu4c as a dependency -- these two well-intentioned white guys in the
> early 2020's aren't going to fix it either.
> 
> (personally i'd be fine with the icu4c dependency, though i still don't think
> it's *useful* for tr(1).)

If bash didn't have case mapping I'd say this whole can of worms is out of
scope, but unfortunately bash has several instances of case mapping.

You're right mistranslating upper/lower case is a failure, but not being able to
tr hiragana to katakana seems like a failure too? (But then it still can't
handle romanji because that's a 2 char to 1 char mapping. Hmmm...)

I'm trying to figure out where the 80/20 line is. It's not hard to parse unicode
points, and for 1/1 mapping lists it's reasonably straightforward what to do
with them. It's this [:banana:] nonsense where it wants to map between leitmotif
and color palette that's confusing: the relationship is not FIXED. (And why does
it even HAVE [:print:]?)

Part of the problem is I'm not a heavy tr user. It always struck me as one of
those obsolete tools like "ed" left over from the daisy-wheel typesetting days
("man" is built on a stack of like 6 of them), and now that I'm trying to figure
out what it's FOR... I find understanding this tool's use case profile highly
non-obvious.

>     Doesn't look too expensive? In musl time() is a clock_gettime(CLOCK_REALTIME)
>     wrapper which lives in the vdso. In bionic you're doing
>     reinterpret_cast<decltype(&time)>(__libc_globals->vdso[VDSO_TIME].fn); before
>     making a call to gettimeofday() which ALSO lives in the vdso...?
> 
> the gettimeofday() is the fallback for kernels whose VDSO doesn't support it.
> (things are especially complicated if you're a 32-bit process on aarch64, for
> example. it's all fixed as of 5.x but we have something like a 7 year rule too.)

I.E. on older systems it would work but be slow. Which as failure modes go...

My point is that performance optimization is a can of worms that would take my
full attention to do right, and I'm not there yet. In the meantime partial
optimizations that change the output of watch and tee and such rub my nose in
the incompleteness of my test suite which can't really see terminal output yet
because I haven't done proper pty emulators. But then I haven't taught "watch"
to parse ascii color change sequences and jumps yet (and I first wrote code to
parse the full DOS ansi.sys sequence set and write directly into VGA text mode
memory at 0xa800000 or whatever it was in 1991).

>     My original https://landley.net/toybox/design.html#goals ordering of "simple,
>     small, fast, and full-featured" has shuffled around a bit already. Several
>     commands have gotten a LOT more full featured, and CONFIG_SORT_SMALL went away a
>     while ago. Needs of the users and all that. I still want simple first, but
>     "simplest implementation of..." has "of" doing some heavy lifting these days.
> 
> doing tr(1) *right* (while a laudable goal) is so hard (or at least "pulls in a
> very large dependency") that i don't think it makes sense until/unless someone
> actually needs it. and the fact that no other tr(1) does it right suggests
> no-one does.

That's hard to argue against.

I'm still not quite sure "handles basic 1-1 unicode point mapping, doesn't care
what each unicode point IS so treats combining characters as a series of unicode
points, none of the [:splat:] macros expand to unicode characters" isn't a
reasonable place to draw the line. But then I don't really know what combining
characters DO and why they exist. "This can't work for everything, can it work
for enough to be useful"... sigh. I dunno.

That's why it's still in pending. "Only does ascii" wasn't so much a choice as a
historical accident held in place by inertia. Accepting that as "what tr should
do" seems sad, but I'm not qualified to improve upon it.

> (yes, i know that kind of logic can be easily abused, but i think
> it's reasonable here --- it's not like GNU is averse to adding
> features/reimplementing libc within their own tools, and they do support Unicode
> just about everywhere else.) and even if you don't buy that, remember that
> "fixed" is complicated here (well, in Turkey and Azerbaijan anyway).

If you want to explicitly list the set of lowercase characters that turn into
each corresponding uppercase character, with your entire alphabet spelled out in
the from and to sections of tr, it SEEMS like tr should be able to handle that?

There should _be_ a command line utility that can do that. Is it lex, maybe?

Anyway, not gonna get to promoting tr this week. I have opened that can of worms
that is shell functions, and am scheduled to fly back to Tokyo on the 20th
(covid willing) and should probably cut a toybox release before getting on the
plane. I've got a bug heap to fix before then with date -I and so on in it...

>     >     Rob
> 
>     Rob

Rob