[Toybox] copy_file_range and (1<<30)

Fri Jan 10 13:40:11 PST 2025

On Fri, May 26, 2023 at 10:26 AM Rob Landley <rob at landley.net> wrote:
>
> On 5/25/23 19:08, enh via Toybox wrote:
> > so i finally enabled copy_file_range for the _host_ toybox because someone
> > pointed out that we copying 16GiB zip files around in the build, and even though
> > obviously we should stop doing that, 30s seemed unreasonable, and coreutils cp
> > "only" took 20s because of copy_file_range.
>
> Hardlinking them is not an option? :)
>
> > but toybox cp with copy_file_range still takes 25s. why?
> >
> >       if (bytes<0 || bytes>(1<<30)) len = (1<<30);
> >
> > the checkin comment being:
> >
> > Update comments and add "sanity check" from kernel commit f16acc9d9b376.
> > (The kernel's been doing this since 2019, but older kernels may not, so...)
>
> The problem being that _before_ that commit, too big a sendfile didn't work
> right (returned an error from the kernel?). I suspect my range check was just
> the largest power of 2 that fit in the constraint...

is that true? the diff for that commit makes it look like it
internally silently used `min(MAX_RW_COUNT, len)` which should be fine
with the usual "subtract what was actually written" logic?

(libc++ just started to use copy_file_range(), and i asked whether
they knew about this limit, and then couldn't explain why toybox has a
special case...)

> > what the kernel _actually_ does though is clamp to MAX_RW_COUNT. which is
> > actually (INT_MAX & PAGE_MASK). which i'm assuming changes for a non-4KiB page
> > kernel?
>
> I don't think any of my test images have a PAGE_SHIFT other than 12? (Looks like
> Alpha, OpenRisc, and 64 bit Sparc are the only 3 architectures that CAN'T use a
> 4k page size, and none of those are exactly load bearing these days.)
>
> But I wouldn't have expected it to be that much slower given the block size here
> is a megabyte, and the number of transactions being submitted... 16 gigs done a
> megabyte at a time is 16k system calls, which is:
>
> $ cat hello2.c
> #include <stdio.h>
>
> int main(int argc, char *argv[])
> {
>   int i;
>
>   for (i = 0; i<16384; i++) dprintf(1, " ");
> }
> $ gcc hello2.c
> $ strace ./a.out 2>&1 | grep write | wc -l
> 16384
> $ time ./a.out | wc
>       0       0   16384
>
> real    0m0.033s
> user    0m0.012s
> sys     0m0.043s
>
> Halving the number of output system calls would theoretically save you around
> 0.015 seconds on a 10 year old laptop.
>
> So why does it have a ~20% impact on the kernel's throughput? The kernel's cap
> isn't even cleanly a power of 2. Maybe the kernel is using 2 megabyte huge pages
> internally in the disk cache, and the smaller size is causing unnecessary
> copying? Is 1<<29 slower or faster than 1<<30? I didn't think letting something
> else get in there and seek was a big deal on ssd? Maybe a different hardware
> burst transaction size?
>
> This isn't even "maybe zerocopy from a userspace buffer above a certain size
> keeps the userspace process suspended so read and write never get to overlap"
> territory: there's no userspace buffer. This is "give the kernel two filehandles
> and a length and let it sort it out". We tell it what to do in very abstract
> terms. In theory the ENTIRE COPY OPERATION could be deferred by the filesystem,
> scheduling it as a big journal entry to update extents. On something like btrfs
> it could be shared extents behind the scenes. What is going ON here?
>
> > sadly 2019 is only 4 years ago, so there's a decent chunk of the 7 year rule
> > left to run out...
>
> I'm happy to change it, but I'd like to understand what's going on? We can
> switch to the kernel's exact size cap (assuming sysconf(_SC_PAGE_SIZE) is
> reliable), but _why_ is that magic number we had to get by reading the kernel
> source faster? We're handing this off to the kernel so it deals with the details
> and _avoids_ this sort of thing...
>
> (Why the kernel guys provided an API that can't handle O_LARGEFILE from 2001, I
> couldn't tell you...)
>
> Rob