[Toybox] copy_file_range and (1<<30)

Fri May 26 13:43:09 PDT 2023

On Fri, May 26, 2023 at 7:26 AM Rob Landley <rob at landley.net> wrote:

> On 5/25/23 19:08, enh via Toybox wrote:
> > so i finally enabled copy_file_range for the _host_ toybox because
> someone
> > pointed out that we copying 16GiB zip files around in the build, and
> even though
> > obviously we should stop doing that, 30s seemed unreasonable, and
> coreutils cp
> > "only" took 20s because of copy_file_range.
>
> Hardlinking them is not an option? :)
>

yeah, that's the "obviously we should stop doing that" part. (we worry
about pushback because the semantics aren't the same, though _personally_
i'd argue that the "it doesn't go stale if you do one part of the rebuild
but not the whole thing" change is _feature_ rather than a bug. but i don't
yet know if anyone's _relying_ on the existing behavior.)

> > but toybox cp with copy_file_range still takes 25s. why?
> >
> >       if (bytes<0 || bytes>(1<<30)) len = (1<<30);
> >
> > the checkin comment being:
> >
> > Update comments and add "sanity check" from kernel commit f16acc9d9b376.
> > (The kernel's been doing this since 2019, but older kernels may not,
> so...)
>
> The problem being that _before_ that commit, too big a sendfile didn't work
> right (returned an error from the kernel?). I suspect my range check was
> just
> the largest power of 2 that fit in the constraint...
>
> > what the kernel _actually_ does though is clamp to MAX_RW_COUNT. which is
> > actually (INT_MAX & PAGE_MASK). which i'm assuming changes for a
> non-4KiB page
> > kernel?
>
> I don't think any of my test images have a PAGE_SHIFT other than 12?
> (Looks like
> Alpha, OpenRisc, and 64 bit Sparc are the only 3 architectures that CAN'T
> use a
> 4k page size, and none of those are exactly load bearing these days.)
>

(not relevant in this context, but darwin/arm64 is 16KiB. people do keep
trying 64KiB linux/arm64, and one of these days they might succeed.)

> But I wouldn't have expected it to be that much slower given the block
> size here
> is a megabyte, and the number of transactions being submitted... 16 gigs
> done a
> megabyte at a time is 16k system calls, which is:
>
> $ cat hello2.c
> #include <stdio.h>
>
> int main(int argc, char *argv[])
> {
>   int i;
>
>   for (i = 0; i<16384; i++) dprintf(1, " ");
> }
> $ gcc hello2.c
> $ strace ./a.out 2>&1 | grep write | wc -l
> 16384
> $ time ./a.out | wc
>       0       0   16384
>
> real    0m0.033s
> user    0m0.012s
> sys     0m0.043s
>
> Halving the number of output system calls would theoretically save you
> around
> 0.015 seconds on a 10 year old laptop.
>
> So why does it have a ~20% impact on the kernel's throughput? The kernel's
> cap
> isn't even cleanly a power of 2. Maybe the kernel is using 2 megabyte huge
> pages
> internally in the disk cache, and the smaller size is causing unnecessary
> copying? Is 1<<29 slower or faster than 1<<30? I didn't think letting
> something
> else get in there and seek was a big deal on ssd? Maybe a different
> hardware
> burst transaction size?
>
> This isn't even "maybe zerocopy from a userspace buffer above a certain
> size
> keeps the userspace process suspended so read and write never get to
> overlap"
> territory: there's no userspace buffer. This is "give the kernel two
> filehandles
> and a length and let it sort it out". We tell it what to do in very
> abstract
> terms. In theory the ENTIRE COPY OPERATION could be deferred by the
> filesystem,
> scheduling it as a big journal entry to update extents. On something like
> btrfs
> it could be shared extents behind the scenes. What is going ON here?
>

excellent questions that should have occurred to me.

i _think_ what happened is that my VM got migrated to a machine with
different performance. i'm now reliably 25s for everyone. (previously my
coreutils testing had been on one day and my toybox on the next.)

so, yeah, changing toybox here makes no noticeable difference.

(on a related note, is there a clever way to make a 16GiB file without
dealing with dd? i used truncate and then cp to de-sparse it, but i was
surprised there wasn't a truncate flag for a non-sparse file. i guess it's
pretty niche though?)

> > sadly 2019 is only 4 years ago, so there's a decent chunk of the 7 year
> rule
> > left to run out...
>
> I'm happy to change it, but I'd like to understand what's going on? We can
> switch to the kernel's exact size cap (assuming sysconf(_SC_PAGE_SIZE) is
> reliable), but _why_ is that magic number we had to get by reading the
> kernel
> source faster? We're handing this off to the kernel so it deals with the
> details
> and _avoids_ this sort of thing...
>
> (Why the kernel guys provided an API that can't handle O_LARGEFILE from
> 2001, I
> couldn't tell you...)
>
> Rob
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20230526/a37d8f11/attachment.htm>