<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, May 26, 2023 at 7:26 AM Rob Landley <<a href="mailto:rob@landley.net">rob@landley.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 5/25/23 19:08, enh via Toybox wrote:<br>
> so i finally enabled copy_file_range for the _host_ toybox because someone<br>
> pointed out that we copying 16GiB zip files around in the build, and even though<br>
> obviously we should stop doing that, 30s seemed unreasonable, and coreutils cp<br>
> "only" took 20s because of copy_file_range.<br>
<br>
Hardlinking them is not an option? :)<br></blockquote><div><br></div><div>yeah, that's the "obviously we should stop doing that" part. (we worry about pushback because the semantics aren't the same, though _personally_ i'd argue that the "it doesn't go stale if you do one part of the rebuild but not the whole thing" change is _feature_ rather than a bug. but i don't yet know if anyone's _relying_ on the existing behavior.)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> but toybox cp with copy_file_range still takes 25s. why?<br>
> <br>
> if (bytes<0 || bytes>(1<<30)) len = (1<<30);<br>
> <br>
> the checkin comment being:<br>
> <br>
> Update comments and add "sanity check" from kernel commit f16acc9d9b376.<br>
> (The kernel's been doing this since 2019, but older kernels may not, so...)<br>
<br>
The problem being that _before_ that commit, too big a sendfile didn't work<br>
right (returned an error from the kernel?). I suspect my range check was just<br>
the largest power of 2 that fit in the constraint...<br>
<br>
> what the kernel _actually_ does though is clamp to MAX_RW_COUNT. which is<br>
> actually (INT_MAX & PAGE_MASK). which i'm assuming changes for a non-4KiB page<br>
> kernel?<br>
<br>
I don't think any of my test images have a PAGE_SHIFT other than 12? (Looks like<br>
Alpha, OpenRisc, and 64 bit Sparc are the only 3 architectures that CAN'T use a<br>
4k page size, and none of those are exactly load bearing these days.)<br></blockquote><div><br></div><div>(not relevant in this context, but darwin/arm64 is 16KiB. people do keep trying 64KiB linux/arm64, and one of these days they might succeed.)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
But I wouldn't have expected it to be that much slower given the block size here<br>
is a megabyte, and the number of transactions being submitted... 16 gigs done a<br>
megabyte at a time is 16k system calls, which is:<br>
<br>
$ cat hello2.c<br>
#include <stdio.h><br>
<br>
int main(int argc, char *argv[])<br>
{<br>
int i;<br>
<br>
for (i = 0; i<16384; i++) dprintf(1, " ");<br>
}<br>
$ gcc hello2.c<br>
$ strace ./a.out 2>&1 | grep write | wc -l<br>
16384<br>
$ time ./a.out | wc<br>
0 0 16384<br>
<br>
real 0m0.033s<br>
user 0m0.012s<br>
sys 0m0.043s<br>
<br>
Halving the number of output system calls would theoretically save you around<br>
0.015 seconds on a 10 year old laptop.<br>
<br>
So why does it have a ~20% impact on the kernel's throughput? The kernel's cap<br>
isn't even cleanly a power of 2. Maybe the kernel is using 2 megabyte huge pages<br>
internally in the disk cache, and the smaller size is causing unnecessary<br>
copying? Is 1<<29 slower or faster than 1<<30? I didn't think letting something<br>
else get in there and seek was a big deal on ssd? Maybe a different hardware<br>
burst transaction size?<br>
<br>
This isn't even "maybe zerocopy from a userspace buffer above a certain size<br>
keeps the userspace process suspended so read and write never get to overlap"<br>
territory: there's no userspace buffer. This is "give the kernel two filehandles<br>
and a length and let it sort it out". We tell it what to do in very abstract<br>
terms. In theory the ENTIRE COPY OPERATION could be deferred by the filesystem,<br>
scheduling it as a big journal entry to update extents. On something like btrfs<br>
it could be shared extents behind the scenes. What is going ON here?<br></blockquote><div><br></div><div>excellent questions that should have occurred to me.</div><div><br></div><div>i _think_ what happened is that my VM got migrated to a machine with different performance. i'm now reliably 25s for everyone. (previously my coreutils testing had been on one day and my toybox on the next.)</div><div><br></div><div>so, yeah, changing toybox here makes no noticeable difference.</div><div><br></div><div>(on a related note, is there a clever way to make a 16GiB file without dealing with dd? i used truncate and then cp to de-sparse it, but i was surprised there wasn't a truncate flag for a non-sparse file. i guess it's pretty niche though?)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> sadly 2019 is only 4 years ago, so there's a decent chunk of the 7 year rule<br>
> left to run out...<br>
<br>
I'm happy to change it, but I'd like to understand what's going on? We can<br>
switch to the kernel's exact size cap (assuming sysconf(_SC_PAGE_SIZE) is<br>
reliable), but _why_ is that magic number we had to get by reading the kernel<br>
source faster? We're handing this off to the kernel so it deals with the details<br>
and _avoids_ this sort of thing...<br>
<br>
(Why the kernel guys provided an API that can't handle O_LARGEFILE from 2001, I<br>
couldn't tell you...)<br>
<br>
Rob<br>
</blockquote></div></div>