[Toybox] buffer sizes

Wed Feb 28 15:02:22 PST 2024

On Wed, Feb 28, 2024 at 1:33 PM Rob Landley <rob at landley.net> wrote:
>
> On 2/28/24 13:14, enh via Toybox wrote:
> > just fyi if you don't follow the coreutils list,
>
> Sadly, I am still subscribed to that because:
>
>   https://lists.gnu.org/archive/html/coreutils/2023-08/msg00100.html
>
> STILL hasn't been addressed. (How many reminders is too many? Are they being
> passive aggressive or just disorganized?)
>
> Recently they've been arguing about hash functions in sort, in a feature (-R)
> I've never even used, let alone implemented. (And would hit with a cheap CRC64
> or something if I did.) I've been "reading with the d key" in that thread, as it
> were...
>
> > i see that they're
> > looking at moving up from 128KiB to 256KiB buffers (but without saying
> > how _much_ "more performance" that gets them, nor what exactly "modern
> > hardware" means).
>
> I remember when L3 cache was introduced in those tyan boards in something like
> 2001. Sweet spots migrate.
>
> Going from "byte" to "block" is a big win. But how BIG the block is often has
> exponentially diminishing returns. A system call is a blocking round trip
> introducing latency into your process (which you never get back no matter how
> parallel the rest of the system is), with opportunity for the scheduler not to
> immediately resume you and so on basically amortized in.
>
> 1 byte to 128 bytes saves you 6 doublings in the number of system calls (round
> trips). Going the rest of the way from 128 bytes to 4k is only 5 doublings, less
> of a win than even that small initial buffer/batching. And going from 4k to 128k
> is again 5 doublings, so _maybe_ another 1/3 gain assuming that's your bottleneck.
>
> > (don't get me wrong --- this is definitely a tricky one. bionic and
> > musl chose smaller-than-traditional values for BUFSIZ for a reason,
> > and although there's a question of whether that applies to a small
> > stand-alone tool like toybox, i'm unconvinced that "one size fits all
> > for toybox" either.
>
> The reason arm64 switched from 4k pages to 64k pages wasn't performance, it was
> a hack to get a bigger physical memory address range without increasing the
> number of page table levels. Moving from 4k to 64k pages let them go from 48 to
> 52 bits of physical memory. (And PISSED OFF the musl maintainer...)

(not sure how we got onto this, but 16KiB page sizes for arm64 are
very much about performance ... apple isn't using 16KiB pages on iOS
to support larger amounts of physical memory :-) )

> Meanwhile, Intel implemented 5 level page tables instead, and did 57 bits
> because as https://www.youtube.com/watch?v=va6nPu-1auE explained, it's "1 more
> than you". ("So there", presumably.)
>
> > there's a world of difference between your minimal
> > mmu-less targets and even the lowest-spec'ed Android Go phone you
> > could ship today, and another couple of orders of magnitude between
> > that and a flagship Android phone. let alone the build servers
> > building AOSP :-) )
>
> Indeed. I'm trying to 80/20 the entire range from wind-up toys to IBM's z-series
> monsters with 4 terabytes per processor (and then you start singing "NUMA NUMA"
> with the question of whether it's to "Louie Louie" or to that muppet song from
> https://www.nbcnews.com/pop-culture/pop-culture-news/mahna-mahna-came-porn-film-flna6c9593504
> being an ongoing subject of debate among scholars.)
>
> A for loop reading data into a buffer and copying it out again is not ideal
> regardless of buffer size. I use splice() ala sendfile() where possible, I've
> worked out an example of writev() so that if there's a second user I can maybe
> genericize that into lib/, and you'll notice count_main() has an xmalloc(65536)
> instead of using toybuf[4096] because yeah, larger block size for that I/O loop
> made sense. (Especially since each run through the loop has FOUR system calls:
> check the time, poll, read the data, check the time again.)
>
> The setvbuf(4096) I recently added to main wasn't about performance (you don't
> printf() or fwrite() to stdout if you need speed, using FILE * is going to copy
> the data twice at the best of times). It's about trying to keep screen flicker
> due to partial screen updates in things like "top" down to a dull roar. (If the
> system schedules you away after the write() system call, you can get an
> arbitrary latency spike halfway through a screen update no matter how efficient
> your code thinks it is.) The size is because 80*25 is 2k, arbitrarily double it,
> with a note that most screens aren't 100% full: seemed like a reasonable amount
> of slack.
>
> I can see arguing that you might need 8k there to avoid screen flicker with the
> large terminal sizes some people tend to use. Or if you're doing a lot of
> unicode (nobody said the user names and command lines top displays WEREN'T unicode).
>
> I admit "large unconditional contiguous allocation in main" starts to get into
> 4096<<(2*CFG_TOYBOX_FORK) territory, but I'd probably be ok with that. I want
> consistent behavior from commands, but "different flicker thresholds on output"
> I can live with. :)
>
> IF that's really an issue...?
>
> Changing the size of toybuf[] seems intrusive. I'd want to audit every command,
> and I don't really want to audit every command.
>
> Rob
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net