<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Apr 14, 2023 at 1:49 PM Rob Landley <<a href="mailto:rob@landley.net" target="_blank">rob@landley.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 4/11/23 19:00, enh wrote:<br>
> On Tue, Apr 11, 2023 at 12:10 PM Rob Landley<br>
> > (unrelated, i've been meaning to ask whether we should make toybuf larger<br>
> > anyway. 4KiB is really small for modern hardware, though at the same time<br>
> > it does make it more likely that we test all the "toybuf too small, loop"<br>
> > cases even with small test inputs...)<br>
> <br>
> A) not a fan of asserts.<br>
> <br>
> i don't like assert(), but **static_assert** is really useful for things like<br>
> this where you want to say "this code makes an assumption that you can test at<br>
> compile time".<br>
<br>
Compile time is the time to care about that sort of thing, yes.<br>
<br>
Kinda wonder about portability on weirdness like qnx (or the guy in email who's<br>
asking about uclibc-ng). I suspect any such asserts would be a CFG_DEBUG option<br>
maybe in portability.c? Hmmm...<br></blockquote><div><br></div><div>they haven't made other complaints about C11, so presumably static_assert is fine too?</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> B) it was only ever coincidentally page size, and huge pages are a thing even on<br>
> x86.<br>
> <br>
> <br>
> well, huge pages are different from non-4KiB non-huge pages.<br>
<br>
There was a lot of talk a while back about getting the kernel to dynamically use<br>
them (false starts when I was reading about it), but I don't follow <a href="http://lwn.net" rel="noreferrer" target="_blank">lwn.net</a> or<br>
lkml nearly as closely in 2023 as I did in 2018. It just got too unpleasant even<br>
to check over the weekly web archive...<br></blockquote><div><br></div><div>_transparent_ huge pages is yet another different thing :-)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> i think it's only<br>
> arm64 where you're at all likely to actually have your page size not be 4KiB.<br>
> (all macs and iphones, for example. i _think_ all the linux distros that tried<br>
> to move gave up?)<br>
<br>
What _is_ Mel Gorman up to these days? He last updated his blog in 2016, and<br>
last tweeted in 2020...<br>
<br>
> I never annotated toybuf or libbuf with any sort of alignment directive or tried<br>
> to make it come first in its segment (toybuf and libbuf are the fifth and sixth<br>
> globals defined in main.c), so they're both reasonably likely to straddle page<br>
> boundaries anyway. Heck, I'm not even sure it's cache line aligned. The actual<br>
> _guarantee_ is something like 4 bytes, except when it suddenly isn't. I fought<br>
> with this in 2021 trying to get a simple "hello world" kernel out of gcc without<br>
> needing a linker script: <a href="https://landley.net/notes-2021.html#12-04-2021" rel="noreferrer" target="_blank">https://landley.net/notes-2021.html#12-04-2021</a><br>
> <<a href="https://landley.net/notes-2021.html#12-04-2021" rel="noreferrer" target="_blank">https://landley.net/notes-2021.html#12-04-2021</a>><br>
> <br>
> now you're on C11, you can easily say this:<br>
> <a href="https://en.cppreference.com/w/c/language/_Alignas" rel="noreferrer" target="_blank">https://en.cppreference.com/w/c/language/_Alignas</a><br>
<br>
Hmmm... Does it actually help to page align them, do you think? Not sure how to<br>
benchmark that...<br></blockquote><div><br></div><div>(i was half expecting it to be page aligned for free, as the joint-second-largest OBJECT in toybox, but readelf says it isn't, and neither is `this`.)</div><div><br></div><div>i don't have good benchmarks either, or i'd have given you data. the build team does keep track of where all the time goes in builds, but (other than that pathological case we had in sed) toybox is unlikely to make their hit list.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> The 4096 is just a convenient scratch pad size. I use sizeof(toybuf) in a bunch<br>
> of places... and hardwire in the knowledge of its size in a bunch of others.<br>
> Plus there's a bunch of implicit "toybuf and/or this slice of it is big enough<br>
> to stick this struct in, so I can safely typecast the pointer" instances I<br>
> checked at the time (and all of them had a big fudge factor in case of future<br>
> glibc bloat).<br>
> <br>
> It's really a "convenient granularity" thing. Copy loops doing byte-at-a-time<br>
> stuff is known terrible because the library and syscall execution paths come to<br>
> dominate, and grouping it into 4k blocks is 12 doublings of efficiency right<br>
> there. Going to 64k is 1/16th as much syscalls, which is not as big a deal as<br>
> 1/4000th as many syscalls. And then raises the question "why not a megabyte<br>
> then" which is something you don't just casually want to do on embedded devices<br>
> without thinking about it (might as well malloc there)...<br>
> <br>
> I could probably be talked into bumping it up to 64k if somebody measured<br>
> numbers saying it would help something specific? <br>
> <br>
> <br>
> i think the time i noticed this was when i was looking into "where the time<br>
> went" and noticed that a 64KiB buffer was quite helpful, at least on the scale<br>
> of "an entire Android build" type of thing.<br>
<br>
Which counts as a good reason to increase it, but is that "bigger toybuf helps<br>
in general" or "some copy loops should use a malloc buffer instead of toybuf"?<br></blockquote><div><br></div><div>(or, as you come to later, "some copy loops should just sendfile() instead".)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
My todo items here tend to be about limitations, things like in ps.c get_ps()<br>
puts struct procpid in toybuf so the strings it's reading are cumulatively<br>
finite (seem the comment around line 878) which mostly impacts cmdline, but with<br>
modern kernel changes that's 10 megabytes per entry so it's probably gotta have<br>
SOME sort of limit it's reading. :)<br>
<br>
That case does a deferred malloc on ps.c line 1000 to copy the result out of<br>
toybuf, so in theory the initial pointer could be replaced by an xmalloc(65536)<br>
and then the current malloc would become a remalloc() to trim it down to what's<br>
actually used. The size calculations are mostly sizeof(toybuf) so could swap to<br>
the new size.<br>
<br>
Well, one exception: the 2048 on line 747 is sort of "half of toybuf" but it's<br>
really just that /proc/$PID/stat is never gonna be longer than this because wc<br>
/proc/$$/stat says 52 entries, one of which is the command name (which has to be<br>
a filename which means it's limited by the VFS maximum of 255 bytes plus the<br>
enclosing parentheses for 257 which I rounded to 260 because null terminator and<br>
4 byte alignment) and then all the OTHER entries are 64 bit numbers printed out<br>
in decimal so 21 bytes each with the space between them, 52*21+260 is 1382, and<br>
I gave it room for future expansion because kernel guys append stuff. (Not that<br>
we'd parse the result if they did, because the for loop on line 764 is<br>
traversing from SLOT_ppid to SLOT_upticks so our iteration count is based on the<br>
array not the input we read.)<br></blockquote><div><br></div><div>(i think for ps, the interesting question is "why does toybox top take 10% cpu?". and although i don't know the answer to that, i don't think that's toybuf's fault :-) )</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> is it _worth_ it? don't know. what's the _optimal_ size? don't know. (and<br>
> probably depends on the specific toy, and 4096 is clearly a sensible _lower_<br>
> bound...)<br>
<br>
Optimization is generally about a use case, you never know if you've actually<br>
HELPED until you can benchmark the result. I can see hardware having moved out<br>
from under us in the past 15 years, with block granularity switched to 4k,<br>
sending 4k pages that aren't guaranteed to be aligned could fairly regularly<br>
have read/edit/write cycles on two blocks it's straddling, although the page<br>
cache and I/O scheduler should really hide all that. (The cpu's also gonna be<br>
doing that under the covers but the _cpu_cache_ should hide that. It's really<br>
operating at cache line granularity and we should be way above that? I'd think<br>
burst read/write should be on the far side of L2 for any system you care about?)<br>
<br>
Last I checked the "new" (like, 10 years now) kernel pipe buffers are 128k and<br>
there are at most 32 of them? So if you're sending between processes maybe that<br>
would come up, but again... need more info. :)<br>
<br>
The place I'd think it would come up most would be sendfile, which is using a<br>
syscall these days and only falling BACK to a loop over libbuf...<br>
<br>
Oh, hang on:<br>
<br>
if (bytes<0 || bytes>(1<<30)) len = (1<<30);<br>
// glibc added this constant in git at the end of 2017, shipped 2018-02.<br>
// Android's had the constant for years, but you'll get SIGSYS if you use<br>
// this system call before Android U (2023's release).<br>
#if defined(__NR_copy_file_range) && !defined(__ANDROID__)<br>
len = syscall(__NR_copy_file_range, in, 0, out, 0, len, 0);<br>
#else<br>
<br>
That might be worth looking into...<br></blockquote><div><br></div><div>well, the _build_ performance is all on the host. (though until we switch from host glibc to host musl, we probably don't have __NR_copy_file_range defined anyway because of our ancient glibc.)</div><div><br></div><div>as for the device, we're caught between caring about the current OS version and wanting the static binary to be able to run on old OS versions. we can trivially remove this, but it will break any calling code for pre-U (which is literally "everything currently in anyone's hands"). right now we don't have any known users of the _device_ toybox static prebuilt, but changing this would likely preclude that.</div><div><br></div><div>if we want to go this route, we'll probably have to go with something like a signal handler for SIGSYS and a boolean. or a weak declaration of the copy_file_range() *function*? no, because that's useless for the static linking we're concerned about.</div><div><br></div><div>oh, wait... we should just be able to use android_get_device_api_level()...</div><div><br></div><div>i think something like this should work?</div><div>```</div><div>diff --git a/lib/portability.c b/lib/portability.c<br>index a9f28154..b205d7b7 100644<br>--- a/lib/portability.c<br>+++ b/lib/portability.c<br>@@ -626,10 +626,19 @@ long long sendfile_len(int in, int out, long long bytes, long long *consumed)<br> if (try_cfr) {<br> if (bytes<0 || bytes>(1<<30)) len = (1<<30);<br> // glibc added this constant in git at the end of 2017, shipped 2018-02.<br>- // Android's had the constant for years, but you'll get SIGSYS if you use<br>+#if defined(__NR_copy_file_range)<br>+ #if defined(__ANDROID__)<br>+ // Android's had the *constant* for years, but you'll get SIGSYS if you use<br> // this system call before Android U (2023's release).<br>-#if defined(__NR_copy_file_range) && !defined(__ANDROID__)<br>+ if (android_get_device_api_level() >= __ANDROID_API_U__) {<br>+ len = syscall(__NR_copy_file_range, in, 0, out, 0, len, 0);<br>+ } else {<br>+ errno = EINVAL;<br>+ len = -1;<br>+ }<br>+ #else<br> len = syscall(__NR_copy_file_range, in, 0, out, 0, len, 0);<br>+ #endif<br> #else<br> errno = EINVAL;<br> len = -1;<br></div><div>```</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Rob<br>
</blockquote></div></div>