[Toybox] sizeof(toybuf)

Fri Apr 14 14:04:10 PDT 2023

On 4/11/23 19:00, enh wrote:
> On Tue, Apr 11, 2023 at 12:10 PM Rob Landley
>     > (unrelated, i've been meaning to ask whether we should make toybuf larger
>     > anyway. 4KiB is really small for modern hardware, though at the same time
>     > it does make it more likely that we test all the "toybuf too small, loop"
>     > cases even with small test inputs...)
> 
>     A) not a fan of asserts.
> 
> i don't like assert(), but **static_assert** is really useful for things like
> this where you want to say "this code makes an assumption that you can test at
> compile time".

Compile time is the time to care about that sort of thing, yes.

Kinda wonder about portability on weirdness like qnx (or the guy in email who's
asking about uclibc-ng). I suspect any such asserts would be a CFG_DEBUG option
maybe in portability.c? Hmmm...

>     B) it was only ever coincidentally page size, and huge pages are a thing even on
>     x86.
> 
> 
> well, huge pages are different from non-4KiB non-huge pages.

There was a lot of talk a while back about getting the kernel to dynamically use
them (false starts when I was reading about it), but I don't follow lwn.net or
lkml nearly as closely in 2023 as I did in 2018. It just got too unpleasant even
to check over the weekly web archive...

> i think it's only
> arm64 where you're at all likely to actually have your page size not be 4KiB.
> (all macs and iphones, for example. i _think_ all the linux distros that tried
> to move gave up?)

What _is_ Mel Gorman up to these days? He last updated his blog in 2016, and
last tweeted in 2020...

>     I never annotated toybuf or libbuf with any sort of alignment directive or tried
>     to make it come first in its segment (toybuf and libbuf are the fifth and sixth
>     globals defined in main.c), so they're both reasonably likely to straddle page
>     boundaries anyway. Heck, I'm not even sure it's cache line aligned. The actual
>     _guarantee_ is something like 4 bytes, except when it suddenly isn't. I fought
>     with this in 2021 trying to get a simple "hello world" kernel out of gcc without
>     needing a linker script: https://landley.net/notes-2021.html#12-04-2021
>     <https://landley.net/notes-2021.html#12-04-2021>
> 
> now you're on C11, you can easily say this:
> https://en.cppreference.com/w/c/language/_Alignas

Hmmm... Does it actually help to page align them, do you think? Not sure how to
benchmark that...

>     The 4096 is just a convenient scratch pad size. I use sizeof(toybuf) in a bunch
>     of places... and hardwire in the knowledge of its size in a bunch of others.
>     Plus there's a bunch of implicit "toybuf and/or this slice of it is big enough
>     to stick this struct in, so I can safely typecast the pointer" instances I
>     checked at the time (and all of them had a big fudge factor in case of future
>     glibc bloat).
> 
>     It's really a "convenient granularity" thing. Copy loops doing byte-at-a-time
>     stuff is known terrible because the library and syscall execution paths come to
>     dominate, and grouping it into 4k blocks is 12 doublings of efficiency right
>     there. Going to 64k is 1/16th as much syscalls, which is not as big a deal as
>     1/4000th as many syscalls. And then raises the question "why not a megabyte
>     then" which is something you don't just casually want to do on embedded devices
>     without thinking about it (might as well malloc there)...
> 
>     I could probably be talked into bumping it up to 64k if somebody measured
>     numbers saying it would help something specific? 
> 
> 
> i think the time i noticed this was when i was looking into "where the time
> went" and noticed that a 64KiB buffer was quite helpful, at least on the scale
> of "an entire Android build" type of thing.

Which counts as a good reason to increase it, but is that "bigger toybuf helps
in general" or "some copy loops should use a malloc buffer instead of toybuf"?

My todo items here tend to be about limitations, things like in ps.c get_ps()
puts struct procpid in toybuf so the strings it's reading are cumulatively
finite (seem the comment around line 878) which mostly impacts cmdline, but with
modern kernel changes that's 10 megabytes per entry so it's probably gotta have
SOME sort of limit it's reading. :)

That case does a deferred malloc on ps.c line 1000 to copy the result out of
toybuf, so in theory the initial pointer could be replaced by an xmalloc(65536)
and then the current malloc would become a remalloc() to trim it down to what's
actually used. The size calculations are mostly sizeof(toybuf) so could swap to
the new size.

Well, one exception: the 2048 on line 747 is sort of "half of toybuf" but it's
really just that /proc/$PID/stat is never gonna be longer than this because wc
/proc/$$/stat says 52 entries, one of which is the command name (which has to be
a filename which means it's limited by the VFS maximum of 255 bytes plus the
enclosing parentheses for 257 which I rounded to 260 because null terminator and
4 byte alignment) and then all the OTHER entries are 64 bit numbers printed out
in decimal so 21 bytes each with the space between them, 52*21+260 is 1382, and
I gave it room for future expansion because kernel guys append stuff. (Not that
we'd parse the result if they did, because the for loop on line 764 is
traversing from SLOT_ppid to SLOT_upticks so our iteration count is based on the
array not the input we read.)

> is it _worth_ it? don't know. what's the _optimal_ size? don't know. (and
> probably depends on the specific toy, and 4096 is clearly a sensible _lower_
> bound...)

Optimization is generally about a use case, you never know if you've actually
HELPED until you can benchmark the result. I can see hardware having moved out
from under us in the past 15 years, with block granularity switched to 4k,
sending 4k pages that aren't guaranteed to be aligned could fairly regularly
have read/edit/write cycles on two blocks it's straddling, although the page
cache and I/O scheduler should really hide all that. (The cpu's also gonna be
doing that under the covers but the _cpu_cache_ should hide that. It's really
operating at cache line granularity and we should be way above that? I'd think
burst read/write should be on the far side of L2 for any system you care about?)

Last I checked the "new" (like, 10 years now) kernel pipe buffers are 128k and
there are at most 32 of them? So if you're sending between processes maybe that
would come up, but again... need more info. :)

The place I'd think it would come up most would be sendfile, which is using a
syscall these days and only falling BACK to a loop over libbuf...

Oh, hang on:

      if (bytes<0 || bytes>(1<<30)) len = (1<<30);
      // glibc added this constant in git at the end of 2017, shipped 2018-02.
      // Android's had the constant for years, but you'll get SIGSYS if you use
      // this system call before Android U (2023's release).
#if defined(__NR_copy_file_range) && !defined(__ANDROID__)
      len = syscall(__NR_copy_file_range, in, 0, out, 0, len, 0);
#else

That might be worth looking into...

Rob