[Toybox] [PATCH] tr: fix pathological flushing.

enh enh at google.com
Tue Jan 12 17:41:13 PST 2021


On Tue, Jan 5, 2021 at 10:53 PM Rob Landley <rob at landley.net> wrote:

> On 1/5/21 7:17 PM, enh wrote:
> > On Sat, Dec 26, 2020 at 11:09 AM Rob Landley <rob at landley.net
> > <mailto:rob at landley.net>> wrote:
> >
> >     On 12/5/20 1:32 PM, enh wrote:
> >     > (you'd expect to see '東해 물과 백두산이' if this actually worked: change
> from
> >     > the hangeul for "east" to the hanja for "east" but leave
> everything else
> >     alone.)
> >     >
> >     > xxd confirms it's just screwing up bytes:
> >
> >     Oh I know the others are all doing bytes, but that doesn't make it
> the right
> >     thing to do. :)
> >
> >     > ~$ echo '동해 물과 백두산이' |  xxd
> >     > 00000000: eb8f 99ed 95b4 20eb acbc eab3 bc20 ebb0  ...... ...... ..
> >     > 00000010: b1eb 9190 ec82 b0ec 9db4 0a              ...........
> >     > ~$ echo '동해 물과 백두산이' | tr '동' '東' | xxd
> >     > 00000000: e69d b1ed 95b4 20e6 acbc eab3 bc20 e6b0  ...... ...... ..
> >     > 00000010: b1e6 9190 ec82 b0ec 9db4 0a              ...........
> >     >
> >     > 동 is 0xeb 0x8f 0x99 and the equivalent Chinese character 東 is 0xe6
> 0x9d
> >     0xb1,
> >     > and you can see from those hex dumps that what tr did was replace
> 0xeb with
> >     > 0xe6, 0x8f with 0x9d, and 0x99 with 0xb1. this did the right thing
> by accident
> >     > for 동 but mangled other characters that contained any of those
> bytes.
> >     >
> >     > and although philosophically i'm usually on board with your "all
> times are
> >     ISO,
> >     > all text is UTF8", i'm really not sure it makes much sense to even
> *try* to
> >     > support this in tr. why? because i think it opens the i18n/l11n
> can of worms
> >     > again. if you think about non-binary uses of tr, they're often
> stuff like
> >     > "convert to all caps", but are we going to get that right for
> Turkish/Azeri
> >     > dotted/dotless 'i's, Greek final/non-final sigma, etc? are we
> going to
> >     have tr's
> >     > behavior then depend on your locale?
> >
> >     I've been handing that off to libc, which I'm aware bionic does not
> currently
> >     do,
> >
> > to be clear, *no* C library can get all of those examples right because
> the C
> > APIs are broken as designed. you genuinely need something like icu4c for
> this.
> > that's *why* bionic only supports the C locale.
>
> Sadly, this is one of the things I've wanted to "study until it makes
> sense and
> I can understand what the right thing to do is", and it's one where
> staring at
> it has made it worse.
>
> Yeah combining characters are out of scope for tr but I thought "map one
> unicode
> point to another unicode point" was a reasonable goal for tr? You're
> making a
> fairly strong case that it isn't. (I knew the unicode consortium was
> insane, but
> I'd assumed a base layer of competency that had gotten crapped upon by
> microsoft
> and IBM joining the committee, which I might be able to dig down to reach.
> Sounds like it's turtles all the way...)
>

to be fair to them, this one's not their fault --- this is those annoying
humans. (rule #1 of i18n --- for any "obviously universally true" thing you
can think of, there's a language for which it isn't true.)


> > but, yeah, in retrospect i shouldn't have led with an example that libc
> *can*
> > get right and followed up with one it can't. the other way round would
> > definitely have been better :-)
> >
> > but i can explain why i did that: we've had a ton of trouble with the
> > dotted/dotless 'i' in Java because you really have to think about
> whether you
> > want that behavior. and it's often quite non-obvious because it doesn't
> just
> > depend on your locale, it depends on your *input*. if i'm a Turk
> changing the
> > case of some Turkish text, yes, i want that locale-specific behavior. if
> i'm a
> > Turk who's just trying to build AOSP, i don't, because it's going to
> corrupt ASCII.
>
> Sigh. I think I need to bow to your expertise here. You at least have
> real-world
> experience with the mess via the bionic and dalvik library bug reports.
>
> Is this _just_ a problem with case mapping? (Can we say [:lower:] and
> [:upper:]
> are the relevant 26 ascii characters and _otherwise_ accept a unicode point
> input at each end as one character?
>

depends what you mean...

yes, "case mapping is broken". for example, take "ΚολοΣΣὸΣ" (where the
initial kappa and the three sigmas are all uppercase, and i've highlighted
the letters to pay attention to). lowercased that becomes "κολοσσὸς"
because the lowercase of "Σ" is *either* "σ" *or* "ς" depending on where in
the word you are. this is why useful APIs for upper/lower case take a
*string* rather than a character.

tr's byte-by-byte/character-by-character model is just not a good fit for
language.

but if you mean "could a tr implementation correctly replace 동 with 東?" (as
in the earlier Korean national anthem example), yes, it could. i'm just not
convinced that's a useful tool and thus worth the complexity/performance
hit of doing it "properly".

(if you're _looking_ for something to do along these lines, i'd suggest -e
ENCODING for strings(1) instead. i've almost implemented it myself
on a couple of occasions, but since one was `-e S` and the other was `-e b`
[and i've never had a use for l/B/L] it hasn't hit my own "useful enough"
threshold yet.)


> > my claim here is that "most Turks^wpeople using tr(1) are programmers
> trying to
> > build something" :-)
> >
> >     you changed the main.c intro code to not pass along environmental
> utf8-ness,
> >     I changed it back and have another pending commit from you (which
> still isn't
> >     going to make a difference on bionic yet anyway... :)
> >
> > that's unrelated. that's some weird macOS-only thing about what they call
> > their locales.
>
> Your updated one is applied.
>

(thanks! after the chmod.test fix, AOSP is back up to date again for the
first time in longer than i'd realized.)


> >     > are we going to deal with combining
> >     > characters too, or do i have to specify all the ways you can write
> "ö" to get
> >     > "Freude, schöner Götterfunken" right (because without a hex dump,
> neither you
> >     > nor i know whether those two 'ö's were encoded the same way)?
> >
> >     No, I was just going to substitute individual utf8 sequences. There
> comes a
> >     point where you switch to "sed"...
> >
> > so what's the point? you can't get this right, and no-one cares anyway,
> but it
> > makes things slower and more complicated because... ? who are you, and
> what have
> > you done to the real rob? :-)
>
> Buried in shell logic, giving everything else combined like 1/4 voltage.
> (And
> shell has to care about this crap too because bash has ${PATH^^*} and
> friends.
> Wildcard matching has [:upper:] and variables have declare -u.)
>
> Did you know you can have exported local variables? I'm not sure who to
> argue
> with about that, but I expect it would require time travel to the
> mid-1980's. (I
> figured out how to make it work, I just object to NEEDING to at a
> conceptual level.)
>
> >     > amusingly, the Plan9 man page only gave one example of using
> tr(1), and it was
> >     > converting ASCII upper/lower. so i don't think they had any _use_
> for it
> >     either,
> >     > they just wrote everything in terms of runes.
> >
> >     They were english-only white guys in the early 90's. Good intentions
> only go so
> >     far if it never leaves the lab and hits real world data.
> >
> > ...which is why the C API is broken for this, and -- unless you're going
> to
> > accept icu4c as a dependency -- these two well-intentioned white guys in
> the
> > early 2020's aren't going to fix it either.
> >
> > (personally i'd be fine with the icu4c dependency, though i still don't
> think
> > it's *useful* for tr(1).)
>
> If bash didn't have case mapping I'd say this whole can of worms is out of
> scope, but unfortunately bash has several instances of case mapping.
>
> You're right mistranslating upper/lower case is a failure, but not being
> able to
> tr hiragana to katakana seems like a failure too? (But then it still can't
> handle romanji because that's a 2 char to 1 char mapping. Hmmm...)
>
> I'm trying to figure out where the 80/20 line is. It's not hard to parse
> unicode
> points, and for 1/1 mapping lists it's reasonably straightforward what to
> do
> with them. It's this [:banana:] nonsense where it wants to map between
> leitmotif
> and color palette that's confusing: the relationship is not FIXED. (And
> why does
> it even HAVE [:print:]?)
>
> Part of the problem is I'm not a heavy tr user. It always struck me as one
> of
> those obsolete tools like "ed" left over from the daisy-wheel typesetting
> days
> ("man" is built on a stack of like 6 of them), and now that I'm trying to
> figure
> out what it's FOR... I find understanding this tool's use case profile
> highly
> non-obvious.
>

that was the case i was trying to make: i'm pretty sure that in 2020 it's
"for" replacing control characters/punctuation with other control
characters/punctuation. "-" -> "_" or " " -> "\n" or whatever.


> >     Doesn't look too expensive? In musl time() is a
> clock_gettime(CLOCK_REALTIME)
> >     wrapper which lives in the vdso. In bionic you're doing
> >
>  reinterpret_cast<decltype(&time)>(__libc_globals->vdso[VDSO_TIME].fn);
> before
> >     making a call to gettimeofday() which ALSO lives in the vdso...?
> >
> > the gettimeofday() is the fallback for kernels whose VDSO doesn't
> support it.
> > (things are especially complicated if you're a 32-bit process on
> aarch64, for
> > example. it's all fixed as of 5.x but we have something like a 7 year
> rule too.)
>
> I.E. on older systems it would work but be slow. Which as failure modes
> go...
>

exactly. (though to be clear "older" in this context is pretty damn recent!)


> My point is that performance optimization is a can of worms that would
> take my
> full attention to do right, and I'm not there yet. In the meantime partial
> optimizations that change the output of watch and tee and such rub my nose
> in
> the incompleteness of my test suite which can't really see terminal output
> yet
> because I haven't done proper pty emulators. But then I haven't taught
> "watch"
> to parse ascii color change sequences and jumps yet (and I first wrote
> code to
> parse the full DOS ansi.sys sequence set and write directly into VGA text
> mode
> memory at 0xa800000 or whatever it was in 1991).
>
> >     My original https://landley.net/toybox/design.html#goals ordering
> of "simple,
> >     small, fast, and full-featured" has shuffled around a bit already.
> Several
> >     commands have gotten a LOT more full featured, and CONFIG_SORT_SMALL
> went away a
> >     while ago. Needs of the users and all that. I still want simple
> first, but
> >     "simplest implementation of..." has "of" doing some heavy lifting
> these days.
> >
> > doing tr(1) *right* (while a laudable goal) is so hard (or at least
> "pulls in a
> > very large dependency") that i don't think it makes sense until/unless
> someone
> > actually needs it. and the fact that no other tr(1) does it right
> suggests
> > no-one does.
>
> That's hard to argue against.
>
> I'm still not quite sure "handles basic 1-1 unicode point mapping, doesn't
> care
> what each unicode point IS so treats combining characters as a series of
> unicode
> points, none of the [:splat:] macros expand to unicode characters" isn't a
> reasonable place to draw the line. But then I don't really know what
> combining
> characters DO and why they exist. "This can't work for everything, can it
> work
> for enough to be useful"... sigh. I dunno.
>
> That's why it's still in pending. "Only does ascii" wasn't so much a
> choice as a
> historical accident held in place by inertia. Accepting that as "what tr
> should
> do" seems sad, but I'm not qualified to improve upon it.
>

so "good enough for now, wait for counterexample"?


> > (yes, i know that kind of logic can be easily abused, but i think
> > it's reasonable here --- it's not like GNU is averse to adding
> > features/reimplementing libc within their own tools, and they do support
> Unicode
> > just about everywhere else.) and even if you don't buy that, remember
> that
> > "fixed" is complicated here (well, in Turkey and Azerbaijan anyway).
>
> If you want to explicitly list the set of lowercase characters that turn
> into
> each corresponding uppercase character, with your entire alphabet spelled
> out in
> the from and to sections of tr, it SEEMS like tr should be able to handle
> that?
>

(the point of my Greek example is that you *can't* do that for all
languages, but...)


> There should _be_ a command line utility that can do that. Is it lex,
> maybe?
>

...yeah, lex could do the broken thing.

lex could also do the "really correct" thing: lex (to break into words)
calling icu4c on each word (to convert case).


> Anyway, not gonna get to promoting tr this week. I have opened that can of
> worms
> that is shell functions, and am scheduled to fly back to Tokyo on the 20th
> (covid willing) and should probably cut a toybox release before getting on
> the
> plane. I've got a bug heap to fix before then with date -I and so on in
> it...
>
> >     >     Rob
> >
> >     Rob
>
> Rob
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20210112/98a8f389/attachment-0001.htm>


More information about the Toybox mailing list