[Toybox] copy_file_range and (1<<30)

Wed Jan 15 08:02:09 PST 2025

On Fri, Jan 10, 2025 at 6:09 PM Rob Landley <rob at landley.net> wrote:
>
> On 1/10/25 15:40, enh wrote:
> > On Fri, May 26, 2023 at 10:26 AM Rob Landley <rob at landley.net> wrote:
> >>
> >> On 5/25/23 19:08, enh via Toybox wrote:
> >>> so i finally enabled copy_file_range for the _host_ toybox because someone
> >>> pointed out that we copying 16GiB zip files around in the build, and even though
> >>> obviously we should stop doing that, 30s seemed unreasonable, and coreutils cp
> >>> "only" took 20s because of copy_file_range.
> >>
> >> Hardlinking them is not an option? :)
> >>
> >>> but toybox cp with copy_file_range still takes 25s. why?
> >>>
> >>>        if (bytes<0 || bytes>(1<<30)) len = (1<<30);
> >>>
> >>> the checkin comment being:
> >>>
> >>> Update comments and add "sanity check" from kernel commit f16acc9d9b376.
> >>> (The kernel's been doing this since 2019, but older kernels may not, so...)
> >>
> >> The problem being that _before_ that commit, too big a sendfile didn't work
> >> right (returned an error from the kernel?). I suspect my range check was just
> >> the largest power of 2 that fit in the constraint...
> >
> > is that true? the diff for that commit makes it look like it
> > internally silently used `min(MAX_RW_COUNT, len)` which should be fine
> > with the usual "subtract what was actually written" logic?
> >
> > (libc++ just started to use copy_file_range(), and i asked whether
> > they knew about this limit, and then couldn't explain why toybox has a
> > special case...)
>
> Let's see... grep for '1<<30' in lib, git annotate the file, search for
> the 1<<30... cut and paste the hash and do a ^1 to peel off that commit
> (sigh, I want a GUI/IDE tool for this where I could just click)... it
> was introduced in toybox commit 9b368059deec which says "Update comments
> and add "sanity check" from kernel commit f16acc9d9b376. (The kernel's
> been doing this since 2019, but older kernels may not, so...)"
>
> The check from that kernel commit was:
>
> +       return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
> +                               len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
>
> And checking out that commit and searching for MAX_RW_COUNT :
>
> include/linux/fs.h:#define MAX_RW_COUNT (INT_MAX & PAGE_MASK)
>
> And of course:
>
> include/asm-generic/page.h:#define PAGE_MASK    (~(PAGE_SIZE-1))
>
> So they're using 2 gigs minus 4k, so yes 1<<30 (1 gig) is the largest
> power of 2 that fits within that constraint.
>
> Is there a big performance difference between queueing up 1 gigabyte at
> a time and 2 gigabytes at a time?

that's not really my concern here, so much as "was this an actual bug,
and which range of kernel versions did it affect?", so that i can pass
the advice on to other projects using copy_file_range() [and add it to
the man page], or even just add the clamping in libc itself so no-one
has to think about this.

> We can be exact, I just didn't want it
> to start going boing again on systems where PAGE_SIZE isn't 4k or some
> such. For a 16 gigabyte file it returns back into our code 16 times (vs
> 8 times) which either way you'd THINK would take a grand total of like a
> milisecond even with switching overhead to deal with. A 5 second
> difference seems... odd? (If you've got zero readahead and zero
> writeback cacheing and you're on rotating media where a slight hitch
> costs a rotational delay then maybe? Even so, it should just be 16 extra
> rotations vs 8 extra rotations. A dog slow ancient hard drive was 1200
> rpm meaning 20 rps meaning 8 extra was less than half a second...)
>
> I mean, IDEALLY the kernel would have a "repeat until spanked" -1 value
> where it just copies until EOF and stops bothering us, but the kernel
> didn't seem to offer that as an option last I checked...
>
> >>> what the kernel _actually_ does though is clamp to MAX_RW_COUNT. which is
> >>> actually (INT_MAX & PAGE_MASK). which i'm assuming changes for a non-4KiB page
> >>> kernel?
>
> Yeah, that. ^^^
>
> The constraint is there because older kernels would error out on too big
> a value, rather than clamping, and I didn't want to query the kernel
> version and have two codepaths.

that's what i was trying to find, and i still haven't. `git log -p
read_write.c` makes it look like the first commit that started calling
do_splice_direct() had the clamping.

commit eac70053a141998c40907747d6cea1d53a9414be
Author: Anna Schumaker <Anna.Schumaker at netapp.com>
Date:   Tue Nov 10 16:53:33 2015 -0500

    vfs: Add vfs_copy_file_range() support for pagecache copies

    This allows us to have an in-kernel copy mechanism that avoids frequent
    switches between kernel and user space.  This is especially useful so
    NFSD can support server-side copies.

    The default (flags=0) means to first attempt copy acceleration, but use
    the pagecache if that fails.

    Signed-off-by: Anna Schumaker <Anna.Schumaker at Netapp.com>
    Reviewed-by: Darrick J. Wong <darrick.wong at oracle.com>
    Reviewed-by: Padraig Brady <P at draigBrady.com>
    Reviewed-by: Christoph Hellwig <hch at lst.de>
    Signed-off-by: Al Viro <viro at zeniv.linux.org.uk>

but in particular, i don't think the kernel commit you quote --
f16acc9d9b376 -- _adds_ a sanity check; it just refactors it into a
new function?

> Rob