<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Dec 26, 2020 at 11:09 AM Rob Landley <<a href="mailto:rob@landley.net">rob@landley.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 12/5/20 1:32 PM, enh wrote:<br>

> (you'd expect to see '東해 물과 백두산이' if this actually worked: change from<br>

> the hangeul for "east" to the hanja for "east" but leave everything else alone.)<br>

> <br>

> xxd confirms it's just screwing up bytes:<br>

<br>

Oh I know the others are all doing bytes, but that doesn't make it the right<br>

thing to do. :)<br>

<br>

> ~$ echo '동해 물과 백두산이' |  xxd<br>

> 00000000: eb8f 99ed 95b4 20eb acbc eab3 bc20 ebb0  ...... ...... ..<br>

> 00000010: b1eb 9190 ec82 b0ec 9db4 0a              ...........<br>

> ~$ echo '동해 물과 백두산이' | tr '동' '東' | xxd<br>

> 00000000: e69d b1ed 95b4 20e6 acbc eab3 bc20 e6b0  ...... ...... ..<br>

> 00000010: b1e6 9190 ec82 b0ec 9db4 0a              ...........<br>

> <br>

> 동 is 0xeb 0x8f 0x99 and the equivalent Chinese character 東 is 0xe6 0x9d 0xb1,<br>

> and you can see from those hex dumps that what tr did was replace 0xeb with<br>

> 0xe6, 0x8f with 0x9d, and 0x99 with 0xb1. this did the right thing by accident<br>

> for 동 but mangled other characters that contained any of those bytes.<br>

> <br>

> and although philosophically i'm usually on board with your "all times are ISO,<br>

> all text is UTF8", i'm really not sure it makes much sense to even *try* to<br>

> support this in tr. why? because i think it opens the i18n/l11n can of worms<br>

> again. if you think about non-binary uses of tr, they're often stuff like<br>

> "convert to all caps", but are we going to get that right for Turkish/Azeri<br>

> dotted/dotless 'i's, Greek final/non-final sigma, etc? are we going to have tr's<br>

> behavior then depend on your locale?<br>

<br>

I've been handing that off to libc, which I'm aware bionic does not currently<br>

do,</blockquote><div><br></div><div>to be clear, *no* C library can get all of those examples right because the C APIs are broken as designed. you genuinely need something like icu4c for this. that's *why* bionic only supports the C locale.</div><div><br></div><div>but, yeah, in retrospect i shouldn't have led with an example that libc *can* get right and followed up with one it can't. the other way round would definitely have been better :-)</div><div><br></div><div>but i can explain why i did that: we've had a ton of trouble with the dotted/dotless 'i' in Java because you really have to think about whether you want that behavior. and it's often quite non-obvious because it doesn't just depend on your locale, it depends on your *input*. if i'm a Turk changing the case of some Turkish text, yes, i want that locale-specific behavior. if i'm a Turk who's just trying to build AOSP, i don't, because it's going to corrupt ASCII.</div><div><br></div><div>my claim here is that "most Turks^wpeople using tr(1) are programmers trying to build something" :-)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">you changed the main.c intro code to not pass along environmental utf8-ness,<br>

I changed it back and have another pending commit from you (which still isn't<br>

going to make a difference on bionic yet anyway... :)<br></blockquote><div><br></div><div>that's unrelated. that's some weird macOS-only thing about what they call their locales.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> are we going to deal with combining<br>

> characters too, or do i have to specify all the ways you can write "ö" to get<br>

> "Freude, schöner Götterfunken" right (because without a hex dump, neither you<br>

> nor i know whether those two 'ö's were encoded the same way)?<br>

<br>

No, I was just going to substitute individual utf8 sequences. There comes a<br>

point where you switch to "sed"...<br></blockquote><div><br></div><div>so what's the point? you can't get this right, and no-one cares anyway, but it makes things slower and more complicated because... ? who are you, and what have you done to the real rob? :-)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> amusingly, the Plan9 man page only gave one example of using tr(1), and it was<br>

> converting ASCII upper/lower. so i don't think they had any _use_ for it either,<br>

> they just wrote everything in terms of runes.<br>

<br>

They were english-only white guys in the early 90's. Good intentions only go so<br>

far if it never leaves the lab and hits real world data.<br></blockquote><div><br></div><div>...which is why the C API is broken for this, and -- unless you're going to accept icu4c as a dependency -- these two well-intentioned white guys in the early 2020's aren't going to fix it either.</div><div><br></div><div>(personally i'd be fine with the icu4c dependency, though i still don't think it's *useful* for tr(1).)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> personally i'd s/characters/bytes/ in the docs and call it done. we can "fix" it<br>

> if/when anyone has an actual practical need for it.<br>

<br>

It's still in pending because I haven't decided what to do yet, but most of that<br>

is I have so many OTHER todo items...<br>

<br>

>     (Also, line buffering sucks because it'll flush at the buffer size anyway so<br>

>     you're not guaranteed to get a full line of contiguous output. What it REALLY<br>

>     wants is nagle's algorithm for stdout but no libc ever bothered to IMPLEMENT it,<br>

>     possibly because of the runtime expense... Ahem. My point is commands should<br>

>     probably do sane output blocking on their own.)<br>

> <br>

> <br>

> iirc line buffering was the compromise we arrived at that neither of us really<br>

> likes, because i just want full buffering (because although i know the problem<br>

> you're trying to solve, i've lived with it for decades without actually feeling<br>

> like it's ever hurt me, and just consider it "working as intended") while you<br>

> want a kind of buffering that doesn't actually exist (and would be non-trivial<br>

> to make exist).<br>

> <br>

> and sadly my list of "anything that can be used in a pipeline during a build"<br>

> and your list of "anything that can be used in a pipeline interactively" [aiui]<br>

> has a lot of overlap.<br>

<br>

It is seasonally appropriate to sing "nagle nagle nagle, made from<br>

gettimeofday()" to the dreidel song:<br>

<br>

  time_t last_out = 0;<br>

<br>

  char *xgetline(void)<br>

  {<br>

    struct pollfd blah = {0, POLLIN};<br>

<br>

    if (TT.last_out && (time()!=TT.last_out || !poll(&blah, 250))<br>

      xflush(), TT.last_out = 0;<br>

<br>

    getline();<br>

  }<br>

<br>

  int xprintf(char *pat, ...)<br>

  {<br>

    TT.last_out = time();<br>

<br>

    printf();<br>

  }<br>

<br>

Doesn't look too expensive? In musl time() is a clock_gettime(CLOCK_REALTIME)<br>

wrapper which lives in the vdso. In bionic you're doing<br>

reinterpret_cast<decltype(&time)>(__libc_globals->vdso[VDSO_TIME].fn); before<br>

making a call to gettimeofday() which ALSO lives in the vdso...?<br></blockquote><div><br></div><div>the gettimeofday() is the fallback for kernels whose VDSO doesn't support it. (things are especially complicated if you're a 32-bit process on aarch64, for example. it's all fixed as of 5.x but we have something like a 7 year rule too.)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Anyway, that's vaguely what I had in mind: implement cheap time-based flushing<br>

by hooking into the existing xgetline() and xprintf() stuff and relying on<br>

checking the time to be in vdso and only only calling the extra poll() when<br>

there's (feasibly) pending output, and that once/second.<br>

<br>

It shouldn't be an unsolvable problem, just not what I've been focusing on...<br>

<br>

> speaking of buffering, one argument i've been avoiding for years is that<br>

> toybuf should be (say) 64KiB rather than just 4KiB. no-one's been asking me for<br>

> anything more than "in the same ballpark" performance, but in many of the cases<br>

> i've looked at, our remaining delta relative to GNU is our much smaller buffer.<br>

> (though as you'd expect, judging from straces over the years, they seem to be<br>

> wildly inconsistent there, and i've seen everything from 8KiB for something like<br>

> tr(1) to 128KiB for cat(1).)<br>

Expanding toybuf makes nommu sad, but we can xmalloc() a bigger buffer if we<br>

need to. I've mentally had "performance tweaks" as post-1.0 todo items, but<br>

there's a userbase now...<br>

<br>

My original <a href="https://landley.net/toybox/design.html#goals" rel="noreferrer" target="_blank">https://landley.net/toybox/design.html#goals</a> ordering of "simple,<br>

small, fast, and full-featured" has shuffled around a bit already. Several<br>

commands have gotten a LOT more full featured, and CONFIG_SORT_SMALL went away a<br>

while ago. Needs of the users and all that. I still want simple first, but<br>

"simplest implementation of..." has "of" doing some heavy lifting these days.<br></blockquote><div><br></div><div>doing tr(1) *right* (while a laudable goal) is so hard (or at least "pulls in a very large dependency") that i don't think it makes sense until/unless someone actually needs it. and the fact that no other tr(1) does it right suggests no-one does. (yes, i know that kind of logic can be easily abused, but i think it's reasonable here --- it's not like GNU is averse to adding features/reimplementing libc within their own tools, and they do support Unicode just about everywhere else.) and even if you don't buy that, remember that "fixed" is complicated here (well, in Turkey and Azerbaijan anyway).</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

>     Rob<br>

<br>

Rob<br>

</blockquote></div></div>