[Toybox] [PATCH] tr: fix pathological flushing.

enh enh at google.com
Tue Jan 5 17:17:03 PST 2021


On Sat, Dec 26, 2020 at 11:09 AM Rob Landley <rob at landley.net> wrote:

> On 12/5/20 1:32 PM, enh wrote:
> > (you'd expect to see '東해 물과 백두산이' if this actually worked: change from
> > the hangeul for "east" to the hanja for "east" but leave everything else
> alone.)
> >
> > xxd confirms it's just screwing up bytes:
>
> Oh I know the others are all doing bytes, but that doesn't make it the
> right
> thing to do. :)
>
> > ~$ echo '동해 물과 백두산이' |  xxd
> > 00000000: eb8f 99ed 95b4 20eb acbc eab3 bc20 ebb0  ...... ...... ..
> > 00000010: b1eb 9190 ec82 b0ec 9db4 0a              ...........
> > ~$ echo '동해 물과 백두산이' | tr '동' '東' | xxd
> > 00000000: e69d b1ed 95b4 20e6 acbc eab3 bc20 e6b0  ...... ...... ..
> > 00000010: b1e6 9190 ec82 b0ec 9db4 0a              ...........
> >
> > 동 is 0xeb 0x8f 0x99 and the equivalent Chinese character 東 is 0xe6 0x9d
> 0xb1,
> > and you can see from those hex dumps that what tr did was replace 0xeb
> with
> > 0xe6, 0x8f with 0x9d, and 0x99 with 0xb1. this did the right thing by
> accident
> > for 동 but mangled other characters that contained any of those bytes.
> >
> > and although philosophically i'm usually on board with your "all times
> are ISO,
> > all text is UTF8", i'm really not sure it makes much sense to even *try*
> to
> > support this in tr. why? because i think it opens the i18n/l11n can of
> worms
> > again. if you think about non-binary uses of tr, they're often stuff like
> > "convert to all caps", but are we going to get that right for
> Turkish/Azeri
> > dotted/dotless 'i's, Greek final/non-final sigma, etc? are we going to
> have tr's
> > behavior then depend on your locale?
>
> I've been handing that off to libc, which I'm aware bionic does not
> currently
> do,


to be clear, *no* C library can get all of those examples right because the
C APIs are broken as designed. you genuinely need something like icu4c for
this. that's *why* bionic only supports the C locale.

but, yeah, in retrospect i shouldn't have led with an example that libc
*can* get right and followed up with one it can't. the other way round
would definitely have been better :-)

but i can explain why i did that: we've had a ton of trouble with the
dotted/dotless 'i' in Java because you really have to think about whether
you want that behavior. and it's often quite non-obvious because it doesn't
just depend on your locale, it depends on your *input*. if i'm a Turk
changing the case of some Turkish text, yes, i want that locale-specific
behavior. if i'm a Turk who's just trying to build AOSP, i don't, because
it's going to corrupt ASCII.

my claim here is that "most Turks^wpeople using tr(1) are programmers
trying to build something" :-)


> you changed the main.c intro code to not pass along environmental
> utf8-ness,
> I changed it back and have another pending commit from you (which still
> isn't
> going to make a difference on bionic yet anyway... :)
>

that's unrelated. that's some weird macOS-only thing about what they call
their locales.


> > are we going to deal with combining
> > characters too, or do i have to specify all the ways you can write "ö"
> to get
> > "Freude, schöner Götterfunken" right (because without a hex dump,
> neither you
> > nor i know whether those two 'ö's were encoded the same way)?
>
> No, I was just going to substitute individual utf8 sequences. There comes a
> point where you switch to "sed"...
>

so what's the point? you can't get this right, and no-one cares anyway, but
it makes things slower and more complicated because... ? who are you, and
what have you done to the real rob? :-)


> > amusingly, the Plan9 man page only gave one example of using tr(1), and
> it was
> > converting ASCII upper/lower. so i don't think they had any _use_ for it
> either,
> > they just wrote everything in terms of runes.
>
> They were english-only white guys in the early 90's. Good intentions only
> go so
> far if it never leaves the lab and hits real world data.
>

...which is why the C API is broken for this, and -- unless you're going to
accept icu4c as a dependency -- these two well-intentioned white guys in
the early 2020's aren't going to fix it either.

(personally i'd be fine with the icu4c dependency, though i still don't
think it's *useful* for tr(1).)


> > personally i'd s/characters/bytes/ in the docs and call it done. we can
> "fix" it
> > if/when anyone has an actual practical need for it.
>
> It's still in pending because I haven't decided what to do yet, but most
> of that
> is I have so many OTHER todo items...
>
> >     (Also, line buffering sucks because it'll flush at the buffer size
> anyway so
> >     you're not guaranteed to get a full line of contiguous output. What
> it REALLY
> >     wants is nagle's algorithm for stdout but no libc ever bothered to
> IMPLEMENT it,
> >     possibly because of the runtime expense... Ahem. My point is
> commands should
> >     probably do sane output blocking on their own.)
> >
> >
> > iirc line buffering was the compromise we arrived at that neither of us
> really
> > likes, because i just want full buffering (because although i know the
> problem
> > you're trying to solve, i've lived with it for decades without actually
> feeling
> > like it's ever hurt me, and just consider it "working as intended")
> while you
> > want a kind of buffering that doesn't actually exist (and would be
> non-trivial
> > to make exist).
> >
> > and sadly my list of "anything that can be used in a pipeline during a
> build"
> > and your list of "anything that can be used in a pipeline interactively"
> [aiui]
> > has a lot of overlap.
>
> It is seasonally appropriate to sing "nagle nagle nagle, made from
> gettimeofday()" to the dreidel song:
>
>   time_t last_out = 0;
>
>   char *xgetline(void)
>   {
>     struct pollfd blah = {0, POLLIN};
>
>     if (TT.last_out && (time()!=TT.last_out || !poll(&blah, 250))
>       xflush(), TT.last_out = 0;
>
>     getline();
>   }
>
>   int xprintf(char *pat, ...)
>   {
>     TT.last_out = time();
>
>     printf();
>   }
>
> Doesn't look too expensive? In musl time() is a
> clock_gettime(CLOCK_REALTIME)
> wrapper which lives in the vdso. In bionic you're doing
> reinterpret_cast<decltype(&time)>(__libc_globals->vdso[VDSO_TIME].fn);
> before
> making a call to gettimeofday() which ALSO lives in the vdso...?
>

the gettimeofday() is the fallback for kernels whose VDSO doesn't support
it. (things are especially complicated if you're a 32-bit process on
aarch64, for example. it's all fixed as of 5.x but we have something like a
7 year rule too.)


> Anyway, that's vaguely what I had in mind: implement cheap time-based
> flushing
> by hooking into the existing xgetline() and xprintf() stuff and relying on
> checking the time to be in vdso and only only calling the extra poll() when
> there's (feasibly) pending output, and that once/second.
>
> It shouldn't be an unsolvable problem, just not what I've been focusing
> on...
>
> > speaking of buffering, one argument i've been avoiding for years is that
> > toybuf should be (say) 64KiB rather than just 4KiB. no-one's been asking
> me for
> > anything more than "in the same ballpark" performance, but in many of
> the cases
> > i've looked at, our remaining delta relative to GNU is our much smaller
> buffer.
> > (though as you'd expect, judging from straces over the years, they seem
> to be
> > wildly inconsistent there, and i've seen everything from 8KiB for
> something like
> > tr(1) to 128KiB for cat(1).)
> Expanding toybuf makes nommu sad, but we can xmalloc() a bigger buffer if
> we
> need to. I've mentally had "performance tweaks" as post-1.0 todo items, but
> there's a userbase now...
>
> My original https://landley.net/toybox/design.html#goals ordering of
> "simple,
> small, fast, and full-featured" has shuffled around a bit already. Several
> commands have gotten a LOT more full featured, and CONFIG_SORT_SMALL went
> away a
> while ago. Needs of the users and all that. I still want simple first, but
> "simplest implementation of..." has "of" doing some heavy lifting these
> days.
>

doing tr(1) *right* (while a laudable goal) is so hard (or at least "pulls
in a very large dependency") that i don't think it makes sense until/unless
someone actually needs it. and the fact that no other tr(1) does it right
suggests no-one does. (yes, i know that kind of logic can be easily abused,
but i think it's reasonable here --- it's not like GNU is averse to adding
features/reimplementing libc within their own tools, and they do support
Unicode just about everywhere else.) and even if you don't buy that,
remember that "fixed" is complicated here (well, in Turkey and Azerbaijan
anyway).


> >     Rob
>
> Rob
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20210105/ecf4a863/attachment.htm>


More information about the Toybox mailing list