[Toybox] buffer sizes

Wed Feb 28 13:41:06 PST 2024

On 2/28/24 13:14, enh via Toybox wrote:
> just fyi if you don't follow the coreutils list,

Sadly, I am still subscribed to that because:

  https://lists.gnu.org/archive/html/coreutils/2023-08/msg00100.html

STILL hasn't been addressed. (How many reminders is too many? Are they being
passive aggressive or just disorganized?)

Recently they've been arguing about hash functions in sort, in a feature (-R)
I've never even used, let alone implemented. (And would hit with a cheap CRC64
or something if I did.) I've been "reading with the d key" in that thread, as it
were...

> i see that they're
> looking at moving up from 128KiB to 256KiB buffers (but without saying
> how _much_ "more performance" that gets them, nor what exactly "modern
> hardware" means).

I remember when L3 cache was introduced in those tyan boards in something like
2001. Sweet spots migrate.

Going from "byte" to "block" is a big win. But how BIG the block is often has
exponentially diminishing returns. A system call is a blocking round trip
introducing latency into your process (which you never get back no matter how
parallel the rest of the system is), with opportunity for the scheduler not to
immediately resume you and so on basically amortized in.

1 byte to 128 bytes saves you 6 doublings in the number of system calls (round
trips). Going the rest of the way from 128 bytes to 4k is only 5 doublings, less
of a win than even that small initial buffer/batching. And going from 4k to 128k
is again 5 doublings, so _maybe_ another 1/3 gain assuming that's your bottleneck.

> (don't get me wrong --- this is definitely a tricky one. bionic and
> musl chose smaller-than-traditional values for BUFSIZ for a reason,
> and although there's a question of whether that applies to a small
> stand-alone tool like toybox, i'm unconvinced that "one size fits all
> for toybox" either.

The reason arm64 switched from 4k pages to 64k pages wasn't performance, it was
a hack to get a bigger physical memory address range without increasing the
number of page table levels. Moving from 4k to 64k pages let them go from 48 to
52 bits of physical memory. (And PISSED OFF the musl maintainer...)

Meanwhile, Intel implemented 5 level page tables instead, and did 57 bits
because as https://www.youtube.com/watch?v=va6nPu-1auE explained, it's "1 more
than you". ("So there", presumably.)

> there's a world of difference between your minimal
> mmu-less targets and even the lowest-spec'ed Android Go phone you
> could ship today, and another couple of orders of magnitude between
> that and a flagship Android phone. let alone the build servers
> building AOSP :-) )

Indeed. I'm trying to 80/20 the entire range from wind-up toys to IBM's z-series
monsters with 4 terabytes per processor (and then you start singing "NUMA NUMA"
with the question of whether it's to "Louie Louie" or to that muppet song from
https://www.nbcnews.com/pop-culture/pop-culture-news/mahna-mahna-came-porn-film-flna6c9593504
being an ongoing subject of debate among scholars.)

A for loop reading data into a buffer and copying it out again is not ideal
regardless of buffer size. I use splice() ala sendfile() where possible, I've
worked out an example of writev() so that if there's a second user I can maybe
genericize that into lib/, and you'll notice count_main() has an xmalloc(65536)
instead of using toybuf[4096] because yeah, larger block size for that I/O loop
made sense. (Especially since each run through the loop has FOUR system calls:
check the time, poll, read the data, check the time again.)

The setvbuf(4096) I recently added to main wasn't about performance (you don't
printf() or fwrite() to stdout if you need speed, using FILE * is going to copy
the data twice at the best of times). It's about trying to keep screen flicker
due to partial screen updates in things like "top" down to a dull roar. (If the
system schedules you away after the write() system call, you can get an
arbitrary latency spike halfway through a screen update no matter how efficient
your code thinks it is.) The size is because 80*25 is 2k, arbitrarily double it,
with a note that most screens aren't 100% full: seemed like a reasonable amount
of slack.

I can see arguing that you might need 8k there to avoid screen flicker with the
large terminal sizes some people tend to use. Or if you're doing a lot of
unicode (nobody said the user names and command lines top displays WEREN'T unicode).

I admit "large unconditional contiguous allocation in main" starts to get into
4096<<(2*CFG_TOYBOX_FORK) territory, but I'd probably be ok with that. I want
consistent behavior from commands, but "different flicker thresholds on output"
I can live with. :)

IF that's really an issue...?

Changing the size of toybuf[] seems intrusive. I'd want to audit every command,
and I don't really want to audit every command.

Rob