[Toybox] copy_file_range and (1<<30)

Sat May 27 07:50:09 PDT 2023

On 5/26/23 15:43, enh wrote:
>     > what the kernel _actually_ does though is clamp to MAX_RW_COUNT. which is
>     > actually (INT_MAX & PAGE_MASK). which i'm assuming changes for a non-4KiB page
>     > kernel?
> 
>     I don't think any of my test images have a PAGE_SHIFT other than 12? (Looks like
>     Alpha, OpenRisc, and 64 bit Sparc are the only 3 architectures that CAN'T use a
>     4k page size, and none of those are exactly load bearing these days.)
> 
> (not relevant in this context, but darwin/arm64 is 16KiB. people do keep trying
> 64KiB linux/arm64, and one of these days they might succeed.)

I remember the litany of ouch from back when Alpha had forced 8k page size and
was thus weird. Among other things, you could only mount certain members of the
ext2 filesystem family on it. It's one of those "they've had all the time in the
world since to fix this" meets "there is zero regression testing so this will
bit-rot tremendously"... I also remember the FUN corner case with QEMU
application emulation where host and target page sizes differed and mmap()
system call translation had to figure out what to do with the leftover bit at
the end of the last page. And I watched YEARS of Mel Gorman trying to make
transparent hugepages work...

*shrug* Not necessarily relevant to modern times, it's entirely possible it all
got fix and is reliable now and I didn't get the memo. But I may have developed
a tendency to just make 4k page size work and then wait for somebody to complain. :)

>     Halving the number of output system calls would theoretically save you around
>     0.015 seconds on a 10 year old laptop.
> 
>     So why does it have a ~20% impact on the kernel's throughput? The kernel's cap
>     isn't even cleanly a power of 2. Maybe the kernel is using 2 megabyte huge pages
>     internally in the disk cache, and the smaller size is causing unnecessary
>     copying? Is 1<<29 slower or faster than 1<<30? I didn't think letting something
>     else get in there and seek was a big deal on ssd? Maybe a different hardware
>     burst transaction size?
> 
>     This isn't even "maybe zerocopy from a userspace buffer above a certain size
>     keeps the userspace process suspended so read and write never get to overlap"
>     territory: there's no userspace buffer. This is "give the kernel two filehandles
>     and a length and let it sort it out". We tell it what to do in very abstract
>     terms. In theory the ENTIRE COPY OPERATION could be deferred by the filesystem,
>     scheduling it as a big journal entry to update extents. On something like btrfs
>     it could be shared extents behind the scenes. What is going ON here?
> 
> excellent questions that should have occurred to me.

I break everything, and keep having to clean up after myself.

> i _think_ what happened is that my VM got migrated to a machine with different
> performance. i'm now reliably 25s for everyone. (previously my coreutils testing
> had been on one day and my toybox on the next.)
> 
> so, yeah, changing toybox here makes no noticeable difference.
> 
> (on a related note, is there a clever way to make a 16GiB file without dealing
> with dd? i used truncate and then cp to de-sparse it, but i was surprised there
> wasn't a truncate flag for a non-sparse file. i guess it's pretty niche though?)

truncate(1) is a wrapper for truncate(2), the system call you want is
posix_fallocate() which hasn't got a command line wrapper I'm aware of... Oh
look, they added one to util-linux. With 8 gazillion command line options for
the linux-specific fallocate(2) syscall, because of course they did. (Collapse
range is only supported on ext4, you say? What a good thing to expose to
userspace...)

Oh goddess, the -x flag. If the underlying filesystem doesn't support the
syscall to do it the fast way, FAIL BY DEFAULT unless you provide a flag to fall
back to do it the slow way. Imagine if cp worked that way, so you needed to say
-x if sendfile() isn't supported. Having an -X to fail if you can't do the fast
path makes sense, but failing by default is... ow.

Why is -l an option?

  $ fallocate walrus
  fallocate: no length argument specified

I mean seriously, WHY IS THIS AN OPTION? Why is it not "always argument #1" and
then you can go offset:len if you want to start later in the file instead of
having a separate -o? What is WRONG with...

  $ fallocate one two
  fallocate: unexpected number of arguments

It doesn't even "FILE..." but instead works on EXACTLY ONE...

       -n, --keep-size
              Do not modify the apparent length of the file.  This may  effec‐
              tively  allocate  blocks  past  EOF, which can be removed with a
              truncate.

Oh look, a new way to damage filesystems I hadn't even thought of. (This 37 byte
README file is eating 2 gigs of disk space. How droll...)

Ahem. Yes, there's a way to do it. Yes I can add it. I may need a bit of a walk
first. And possibly a muffin.

Rob