[Toybox] [PATCH] count.c: Human readable -h option and MAYFORK

Thu Oct 19 23:05:56 PDT 2023

On 10/19/23 13:20, enh wrote:
>> > if you had
>> > ```
>> > if (which->flags & TOYFLAG_LINEBUF) setvbuf(stdout, 0, _IOLBF, 0);
>> > ```
>> > we'd argue a lot less about buffering :-)
>>
>> The existing LINEBUF users are ascii, base64, base32, yes, echo, grep, egrep, fgrep.
> 
> yeah, i think we're in agreement about line buffering being reasonable for grep.
> 
> i'm just unconvinced that _no_ buffering is the right default...

I'm unconvinced there IS a right default.

I was originally leaving it to libc to provide a default, but A) it was
inconsistent between different libc's and even release versions, B) upgrades
broke "less" on more than one occasion, not just for toybox but within distros
I've used. (As in I've seen Red Hat break this and Ubuntu break this over the
years, because they trusted libc's buffering to do something predictable.)

I'm going to have to audit all the commands to categorize "produces progressive
output" and "does not produce progressive output" (which is sadly kind of HARD
because... wget? I mean... it COULD...) and then subset those into "writes to a
file descriptor" (which doesn't let it off the hook) and "writes through stdio",
and for each figure out what the performance looks like vs other implementations
and whether I can/should easily improve it.

Mostly, I wait for somebody to complain. But since the android build has already
had more than one performance bottleneck I've already had to tweak toybox for,
"performance" is bubbling up higher on my todo list. The gestalt of existing
complaints has accumulated some momentum.

>> Both ascii and echo pretty much want a single big output buffer, they do not
>> produce progressive output so it might as well all be collated into one big
>> output buffer and then flushed.
>>
>> I just rewrote yes not to use stdio at all, it does writev() to fd 1.
> 
> ...not least because you end up doing stuff like this.

Because the one _with_ the line buffering was two orders of magnitude slower
than debian's, and writev() lets me multiply the output-per-syscall without
allocating an unbounded amount of extra memory for copies.

Right now 30% slower doesn't matter and 50% slower is only if somebody
_specifically_ complains, but 10x slower? Or more? Harder to ignore.

> i worry that this might cause regressions too (based on the commit
> message from the change that made yes line-buffered):

Debian's is doing 3g/s output on my laptop which means it's using large
transaction sizes. Let's strace it:

write(1, "y\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\n"..., 8192) = 8192
write(1, "y\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\n"..., 8192) = 8192
write(1, "y\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\n"..., 8192) = 8192
write(1, "y\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\n"..., 8192) = 8192

Outputting more than one y\n per transaction can't break stuff or the debian one
would have triggered it.

> 1. 
> 
> (though i will accept the argument that "that's a toybox xargs test
> bug", but (a) we'll still need to fix that test

Any idea off the top of your head which of the 31 tests it was? (I can use the
git commit time to go dig back through the web archive and my blog and such, but
not right now. I added a TODO note to my tests/xargs.test file to remind me.)

> and (b) if we're going
> to go that way, did you measure what just using the trivial yes
> implementation and default stdio buffering gets you?)

$ cat yas.c
#include <stdio.h>

int main(int argc, char *argv[])
{
  for (;;) printf("y\n");
}
$ gcc yas.c
$ timeout 1 ./a.out | ./count -l >/dev/null

It wrote 99m. The iovec one is writing 2.7G on the same hardware.

Oddly enough your transaction size can be TOO big and slow down a little from
the peak. I haven't looked at the kernel pipe implementation to see when it
zerocopies provided buffers and when it copies them to chop them up, mostly
because that really really really really should be the kernel guys' problem, and
is almost guaranteed to be different in 5 years, and probably varies by hardware
anyway depending on how expensive page table fiddling is and whether "can I just
block the producer by not returning from the write() until the consumer's
accepted the zerocopy data out of the pipe buffer" is an acceptable strategy or
what...

I am thus NOT attempting to calculate an optimal iovec[] length based on the
provided command line string data, I'm just saying I was aware of the option.
(But that way lies madness and gnu.)

Rob

P.S. Count's X/s time output is actually X/period and the period is 250ms, but
it's also zeroing new periods and advancing into the zeroed period immediately
in the calculation as a full period when we've only been there for less than a
full 250ms, and what it SHOULD do is look at the remainder of millitime()%250
and use that to add a fraction of the oldest period (not otherwise included, and
not included as a period in the divisor) so "however much data the new one
accumulated in less than a full period of time" plus "the reciprocal portion of
the old period we left" should equal a coherent count.

P.P.S. Yes, I remember when this was 'while (0<(len = read(0, buf,
sizeof(buf)))) { write(1, buf, len); dprintf(2, "%llu\n", total += len); }' or
possibly yes.c I remember...