[Toybox] [PATCH] tr: fix pathological flushing.

Rob Landley rob at landley.net
Sat Dec 26 11:21:52 PST 2020


On 12/5/20 1:32 PM, enh wrote:
> (you'd expect to see '東해 물과 백두산이' if this actually worked: change from
> the hangeul for "east" to the hanja for "east" but leave everything else alone.)
> 
> xxd confirms it's just screwing up bytes:

Oh I know the others are all doing bytes, but that doesn't make it the right
thing to do. :)

> ~$ echo '동해 물과 백두산이' |  xxd
> 00000000: eb8f 99ed 95b4 20eb acbc eab3 bc20 ebb0  ...... ...... ..
> 00000010: b1eb 9190 ec82 b0ec 9db4 0a              ...........
> ~$ echo '동해 물과 백두산이' | tr '동' '東' | xxd
> 00000000: e69d b1ed 95b4 20e6 acbc eab3 bc20 e6b0  ...... ...... ..
> 00000010: b1e6 9190 ec82 b0ec 9db4 0a              ...........
> 
> 동 is 0xeb 0x8f 0x99 and the equivalent Chinese character 東 is 0xe6 0x9d 0xb1,
> and you can see from those hex dumps that what tr did was replace 0xeb with
> 0xe6, 0x8f with 0x9d, and 0x99 with 0xb1. this did the right thing by accident
> for 동 but mangled other characters that contained any of those bytes.
> 
> and although philosophically i'm usually on board with your "all times are ISO,
> all text is UTF8", i'm really not sure it makes much sense to even *try* to
> support this in tr. why? because i think it opens the i18n/l11n can of worms
> again. if you think about non-binary uses of tr, they're often stuff like
> "convert to all caps", but are we going to get that right for Turkish/Azeri
> dotted/dotless 'i's, Greek final/non-final sigma, etc? are we going to have tr's
> behavior then depend on your locale?

I've been handing that off to libc, which I'm aware bionic does not currently
do, you changed the main.c intro code to not pass along environmental utf8-ness,
I changed it back and have another pending commit from you (which still isn't
going to make a difference on bionic yet anyway... :)

> are we going to deal with combining
> characters too, or do i have to specify all the ways you can write "ö" to get
> "Freude, schöner Götterfunken" right (because without a hex dump, neither you
> nor i know whether those two 'ö's were encoded the same way)?

No, I was just going to substitute individual utf8 sequences. There comes a
point where you switch to "sed"...

> amusingly, the Plan9 man page only gave one example of using tr(1), and it was
> converting ASCII upper/lower. so i don't think they had any _use_ for it either,
> they just wrote everything in terms of runes.

They were english-only white guys in the early 90's. Good intentions only go so
far if it never leaves the lab and hits real world data.

> personally i'd s/characters/bytes/ in the docs and call it done. we can "fix" it
> if/when anyone has an actual practical need for it.

It's still in pending because I haven't decided what to do yet, but most of that
is I have so many OTHER todo items...

>     (Also, line buffering sucks because it'll flush at the buffer size anyway so
>     you're not guaranteed to get a full line of contiguous output. What it REALLY
>     wants is nagle's algorithm for stdout but no libc ever bothered to IMPLEMENT it,
>     possibly because of the runtime expense... Ahem. My point is commands should
>     probably do sane output blocking on their own.)
> 
> 
> iirc line buffering was the compromise we arrived at that neither of us really
> likes, because i just want full buffering (because although i know the problem
> you're trying to solve, i've lived with it for decades without actually feeling
> like it's ever hurt me, and just consider it "working as intended") while you
> want a kind of buffering that doesn't actually exist (and would be non-trivial
> to make exist).
> 
> and sadly my list of "anything that can be used in a pipeline during a build"
> and your list of "anything that can be used in a pipeline interactively" [aiui]
> has a lot of overlap.

It is seasonally appropriate to sing "nagle nagle nagle, made from
gettimeofday()" to the dreidel song:

  time_t last_out = 0;

  char *xgetline(void)
  {
    struct pollfd blah = {0, POLLIN};

    if (TT.last_out && (time()!=TT.last_out || !poll(&blah, 250))
      xflush(), TT.last_out = 0;

    getline();
  }

  int xprintf(char *pat, ...)
  {
    TT.last_out = time();

    printf();
  }

Doesn't look too expensive? In musl time() is a clock_gettime(CLOCK_REALTIME)
wrapper which lives in the vdso. In bionic you're doing
reinterpret_cast<decltype(&time)>(__libc_globals->vdso[VDSO_TIME].fn); before
making a call to gettimeofday() which ALSO lives in the vdso...?

Anyway, that's vaguely what I had in mind: implement cheap time-based flushing
by hooking into the existing xgetline() and xprintf() stuff and relying on
checking the time to be in vdso and only only calling the extra poll() when
there's (feasibly) pending output, and that once/second.

It shouldn't be an unsolvable problem, just not what I've been focusing on...

> speaking of buffering, one argument i've been avoiding for years is that
> toybuf should be (say) 64KiB rather than just 4KiB. no-one's been asking me for
> anything more than "in the same ballpark" performance, but in many of the cases
> i've looked at, our remaining delta relative to GNU is our much smaller buffer.
> (though as you'd expect, judging from straces over the years, they seem to be
> wildly inconsistent there, and i've seen everything from 8KiB for something like
> tr(1) to 128KiB for cat(1).)
Expanding toybuf makes nommu sad, but we can xmalloc() a bigger buffer if we
need to. I've mentally had "performance tweaks" as post-1.0 todo items, but
there's a userbase now...

My original https://landley.net/toybox/design.html#goals ordering of "simple,
small, fast, and full-featured" has shuffled around a bit already. Several
commands have gotten a LOT more full featured, and CONFIG_SORT_SMALL went away a
while ago. Needs of the users and all that. I still want simple first, but
"simplest implementation of..." has "of" doing some heavy lifting these days.

>     Rob

Rob



More information about the Toybox mailing list