<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Dec 5, 2020 at 3:38 AM Rob Landley <<a href="mailto:rob@landley.net">rob@landley.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 12/4/20 1:58 PM, enh via Toybox wrote:<br>
> The AOSP build doesn't use tr (or anything that's still in pending), but<br>
> the kernel folks have been more aggressive. They found that tr's<br>
> pathological flushing was adding minutes to their build times.<br>
<br>
+ while ((n = read(0, toybuf, sizeof(toybuf)))) {<br>
+ if (!FLAG(d) && !FLAG(s)) {<br>
+ for (dst = 0; dst < n; dst++) toybuf[dst] = TT.map[toybuf[dst]];<br>
<br>
And when read returns -1 what happens?<br></blockquote><div><br></div><div>enh@ realizes he thought he'd changed this to xread and sends a patch, that's what :-)</div><div><br></div><div>(patch sent.)</div><div><br></div><div>it's a pity there's no /dev/eio or whatever to test these cases explicitly.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Sounds like time for me to cleanup and promote this command. (Let's see, my todo<br>
entry for tr is: "TODO: -t (truncate) -a (ascii)" and I do NOT remember what -a<br>
is but I think I was going to make tr support utf8? Which required a redesign.<br>
Possibly it should be -U to support utf8 instead but everything ELSE supports<br>
utf8 by default, because planet. Hmmm...)<br></blockquote><div><br></div><div>aye, but i don't think anyone makes you _pay_ for unicode (if you ignore Plan9 and Inferno, both of which did, but you'd have to go a long way to find anyone else who's ever used either of those).</div><div><br></div><div>but more than that i'm not sure anyone else _has_ done this (if you ignore Plan9 and Inferno, both of which did, but you'd have to go a long way to find anyone else who's ever used either of those):</div><div><br></div><div>~$ echo '동해 물과 백두산이' | tr '동' '東' <br>東해 欼과 氱摐산이<br>~$ echo '동해 물과 백두산이' | busybox tr '동' '東' <br>東해 欼과 氱摐산이<br>~$ <br></div><div><br></div><div>(you'd expect to see '東해 물과 백두산이' if this actually worked: change from the hangeul for "east" to the hanja for "east" but leave everything else alone.)</div><div><br></div><div>xxd confirms it's just screwing up bytes:</div><div><br></div><div><div>~$ echo '동해 물과 백두산이' | xxd<br>00000000: eb8f 99ed 95b4 20eb acbc eab3 bc20 ebb0 ...... ...... ..<br>00000010: b1eb 9190 ec82 b0ec 9db4 0a ...........<br></div><div></div></div><div>~$ echo '동해 물과 백두산이' | tr '동' '東' | xxd<br>00000000: e69d b1ed 95b4 20e6 acbc eab3 bc20 e6b0 ...... ...... ..<br>00000010: b1e6 9190 ec82 b0ec 9db4 0a ...........<br></div><div><br></div><div>동 is 0xeb 0x8f 0x99 and the equivalent Chinese character 東 is 0xe6 0x9d 0xb1, and you can see from those hex dumps that what tr did was replace 0xeb with 0xe6, 0x8f with 0x9d, and 0x99 with 0xb1. this did the right thing by accident for 동 but mangled other characters that contained any of those bytes.<br></div><div><br></div><div>and although philosophically i'm usually on board with your "all times are ISO, all text is UTF8", i'm really not sure it makes much sense to even *try* to support this in tr. why? because i think it opens the i18n/l11n can of worms again. if you think about non-binary uses of tr, they're often stuff like "convert to all caps", but are we going to get that right for Turkish/Azeri dotted/dotless 'i's, Greek final/non-final sigma, etc? are we going to have tr's behavior then depend on your locale? are we going to deal with combining characters too, or do i have to specify all the ways you can write "ö" to get "Freude, schöner Götterfunken" right (because without a hex dump, neither you nor i know whether those two 'ö's were encoded the same way)?</div><div><br></div><div>amusingly, the Plan9 man page only gave one example of using tr(1), and it was converting ASCII upper/lower. so i don't think they had any _use_ for it either, they just wrote everything in terms of runes.</div><div><br></div><div>personally i'd s/characters/bytes/ in the docs and call it done. we can "fix" it if/when anyone has an actual practical need for it.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> Just removing the fflush() made tr significantly faster for my trivial<br>
> test, but still slow, with all the time going into stdio.<br>
<br>
Single byte writes suck no matter how you slice 'em.<br>
<br>
> Rewriting the<br>
> loop to modify toybuf in place and then do one write per read made most<br>
> of the difference, but special-casing the "neither -d nor -s" case made<br>
> a measurable difference too on a Xeon.<br>
<br>
Sigh. I have this patch in my tree, which I haven't applied yet because I don't<br>
have the regression test setup to see what it would slow down in the AOSP build:<br>
<br>
--- a/main.c<br>
+++ b/main.c<br>
@@ -103,7 +103,7 @@ void toy_singleinit(struct toy_list *which, char *argv[])<br>
// that choose non-UTF-8 locales. macOS doesn't support C.UTF-8 though.<br>
if (!setlocale(LC_CTYPE, "C.UTF-8")) setlocale(LC_CTYPE, "");<br>
}<br>
- setlinebuf(stdout);<br>
+ setvbuf(stdout, 0, (which->flags & TOYFLAG_LINEBUF) ? _IOLBF : _IONBF, 0);<br>
}<br>
}<br>
<br>
--- a/lib/toyflags.h<br>
+++ b/lib/toyflags.h<br>
@@ -32,6 +32,9 @@<br>
// Suppress default --help processing<br>
#define TOYFLAG_NOHELP (1<<10)<br>
<br>
+// Line buffered stdout<br>
+#define TOYFLAG_LINEBUF (1<<11)<br>
+<br>
// Error code to return if argument parsing fails (default 1)<br>
#define TOYFLAG_ARGFAIL(x) (x<<24)<br>
<br>
--- a/toys/posix/grep.c<br>
+++ b/toys/posix/grep.c<br>
@@ -10,9 +10,9 @@<br>
* echo hello | grep -f </dev/null<br>
*<br>
<br>
-USE_GREP(NEWTOY(grep,<br>
"(line-buffered)(color):;(exclude-dir)*S(exclude)*M(include)*ZzEFHIab(byte-offset)h(no-filename)ino(only-matching)rRsvwcl(files-with-matches)q(quiet)(silent)e*f*C#B#A#m#x[!wx][!EFw]",<br>
TOYFLAG_BIN|TOYFLAG_ARGFAIL(2)))<br>
-USE_EGREP(OLDTOY(egrep, grep, TOYFLAG_BIN|TOYFLAG_ARGFAIL(2)))<br>
-USE_FGREP(OLDTOY(fgrep, grep, TOYFLAG_BIN|TOYFLAG_ARGFAIL(2)))<br>
+USE_GREP(NEWTOY(grep,<br>
"(line-buffered)(color):;(exclude-dir)*S(exclude)*M(include)*ZzEFHIab(byte-offset)h(no-filename)ino(only-matching)rRsvwcl(files-with-matches)q(quiet)(silent)e*f*C#B#A#m#x[!wx][!EFw]",<br>
TOYFLAG_BIN|TOYFLAG_ARGFAIL(2)|TOYFLAG_LINEBUF))<br>
+USE_EGREP(OLDTOY(egrep, grep, TOYFLAG_BIN|TOYFLAG_ARGFAIL(2)|TOYFLAG_LINEBUF))<br>
+USE_FGREP(OLDTOY(fgrep, grep, TOYFLAG_BIN|TOYFLAG_ARGFAIL(2)|TOYFLAG_LINEBUF))<br>
<br>
config GREP<br>
bool "grep"<br>
<br>
I started down that path because changing everything to line buffered by default<br>
is why we had to add an extra flush to passwd() and broke "watch". (The second<br>
is cosmetic breakage, but still...)<br>
<br>
I would very much LIKE to move in the direction of knowing what commands<br>
actually need (having line buffered be something we switch on in commands rather<br>
than switch on everywhere by default), but I don't want to surprise you by<br>
pulling the trigger on that one without some sort of coordination.<br>
<br>
(Also, line buffering sucks because it'll flush at the buffer size anyway so<br>
you're not guaranteed to get a full line of contiguous output. What it REALLY<br>
wants is nagle's algorithm for stdout but no libc ever bothered to IMPLEMENT it,<br>
possibly because of the runtime expense... Ahem. My point is commands should<br>
probably do sane output blocking on their own.)<br></blockquote><div><br></div><div>iirc line buffering was the compromise we arrived at that neither of us really likes, because i just want full buffering (because although i know the problem you're trying to solve, i've lived with it for decades without actually feeling like it's ever hurt me, and just consider it "working as intended") while you want a kind of buffering that doesn't actually exist (and would be non-trivial to make exist).</div><div><br></div><div>and sadly my list of "anything that can be used in a pipeline during a build" and your list of "anything that can be used in a pipeline interactively" [aiui] has a lot of overlap.</div><div><br></div><div>speaking of buffering, one argument i've been avoiding for years is that toybuf should be (say) 64KiB rather than just 4KiB. no-one's been asking me for anything more than "in the same ballpark" performance, but in many of the cases i've looked at, our remaining delta relative to GNU is our much smaller buffer. (though as you'd expect, judging from straces over the years, they seem to be wildly inconsistent there, and i've seen everything from 8KiB for something like tr(1) to 128KiB for cat(1).)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Rob<br>
</blockquote></div></div>