<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, May 26, 2023 at 7:26 AM Rob Landley <<a href="mailto:rob@landley.net">rob@landley.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 5/25/23 19:08, enh via Toybox wrote:<br>

> so i finally enabled copy_file_range for the _host_ toybox because someone<br>

> pointed out that we copying 16GiB zip files around in the build, and even though<br>

> obviously we should stop doing that, 30s seemed unreasonable, and coreutils cp<br>

> "only" took 20s because of copy_file_range.<br>

<br>

Hardlinking them is not an option? :)<br></blockquote><div><br></div><div>yeah, that's the "obviously we should stop doing that" part. (we worry about pushback because the semantics aren't the same, though _personally_ i'd argue that the "it doesn't go stale if you do one part of the rebuild but not the whole thing" change is _feature_ rather than a bug. but i don't yet know if anyone's _relying_ on the existing behavior.)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> but toybox cp with copy_file_range still takes 25s. why?<br>

> <br>

>       if (bytes<0 || bytes>(1<<30)) len = (1<<30);<br>

> <br>

> the checkin comment being:<br>

> <br>

> Update comments and add "sanity check" from kernel commit f16acc9d9b376.<br>

> (The kernel's been doing this since 2019, but older kernels may not, so...)<br>

<br>

The problem being that _before_ that commit, too big a sendfile didn't work<br>

right (returned an error from the kernel?). I suspect my range check was just<br>

the largest power of 2 that fit in the constraint...<br>

<br>

> what the kernel _actually_ does though is clamp to MAX_RW_COUNT. which is<br>

> actually (INT_MAX & PAGE_MASK). which i'm assuming changes for a non-4KiB page<br>

> kernel?<br>

<br>

I don't think any of my test images have a PAGE_SHIFT other than 12? (Looks like<br>

Alpha, OpenRisc, and 64 bit Sparc are the only 3 architectures that CAN'T use a<br>

4k page size, and none of those are exactly load bearing these days.)<br></blockquote><div><br></div><div>(not relevant in this context, but darwin/arm64 is 16KiB. people do keep trying 64KiB linux/arm64, and one of these days they might succeed.)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

But I wouldn't have expected it to be that much slower given the block size here<br>

is a megabyte, and the number of transactions being submitted... 16 gigs done a<br>

megabyte at a time is 16k system calls, which is:<br>

<br>

$ cat hello2.c<br>

#include <stdio.h><br>

<br>

int main(int argc, char *argv[])<br>

{<br>

  int i;<br>

<br>

  for (i = 0; i<16384; i++) dprintf(1, " ");<br>

}<br>

$ gcc hello2.c<br>

$ strace ./a.out 2>&1 | grep write | wc -l<br>

16384<br>

$ time ./a.out | wc<br>

      0       0   16384<br>

<br>

real    0m0.033s<br>

user    0m0.012s<br>

sys     0m0.043s<br>

<br>

Halving the number of output system calls would theoretically save you around<br>

0.015 seconds on a 10 year old laptop.<br>

<br>

So why does it have a ~20% impact on the kernel's throughput? The kernel's cap<br>

isn't even cleanly a power of 2. Maybe the kernel is using 2 megabyte huge pages<br>

internally in the disk cache, and the smaller size is causing unnecessary<br>

copying? Is 1<<29 slower or faster than 1<<30? I didn't think letting something<br>

else get in there and seek was a big deal on ssd? Maybe a different hardware<br>

burst transaction size?<br>

<br>

This isn't even "maybe zerocopy from a userspace buffer above a certain size<br>

keeps the userspace process suspended so read and write never get to overlap"<br>

territory: there's no userspace buffer. This is "give the kernel two filehandles<br>

and a length and let it sort it out". We tell it what to do in very abstract<br>

terms. In theory the ENTIRE COPY OPERATION could be deferred by the filesystem,<br>

scheduling it as a big journal entry to update extents. On something like btrfs<br>

it could be shared extents behind the scenes. What is going ON here?<br></blockquote><div><br></div><div>excellent questions that should have occurred to me.</div><div><br></div><div>i _think_ what happened is that my VM got migrated to a machine with different performance. i'm now reliably 25s for everyone. (previously my coreutils testing had been on one day and my toybox on the next.)</div><div><br></div><div>so, yeah, changing toybox here makes no noticeable difference.</div><div><br></div><div>(on a related note, is there a clever way to make a 16GiB file without dealing with dd? i used truncate and then cp to de-sparse it, but i was surprised there wasn't a truncate flag for a non-sparse file. i guess it's pretty niche though?)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> sadly 2019 is only 4 years ago, so there's a decent chunk of the 7 year rule<br>

> left to run out...<br>

<br>

I'm happy to change it, but I'd like to understand what's going on? We can<br>

switch to the kernel's exact size cap (assuming sysconf(_SC_PAGE_SIZE) is<br>

reliable), but _why_ is that magic number we had to get by reading the kernel<br>

source faster? We're handing this off to the kernel so it deals with the details<br>

and _avoids_ this sort of thing...<br>

<br>

(Why the kernel guys provided an API that can't handle O_LARGEFILE from 2001, I<br>

couldn't tell you...)<br>

<br>

Rob<br>

</blockquote></div></div>