[Toybox] cp --sparse=

Wed Feb 18 10:15:56 PST 2026

On Tue, Feb 17, 2026 at 5:17 PM Rob Landley <rob at landley.net> wrote:
>
> On 2/13/26 15:10, enh wrote:
> > yeah, given that the "is there at least one hole?" test is just a
> > single lseek(), it's odd to me that it's not the coreutils default. i
> > feel like "preserve holes" and "make holes wherever possible" are two
> > unrelated behaviors.
>
> https://github.com/mpe/linux-fullhistory/commit/982d816581ee is only
> from 2011, and I'm assuming "hurd" doesn't support it anyway.
>
> Doing an lseek(SEEK_HOLE) on each source file for auto mode works for
> me. Shouldn't be that expensive, and if there is it's sort of prefetch.
> (Not that I've benched it, but you're not going find a faster way to get
> this data so the alternative is DON'T get this data. Which would be
> "defer asking the question until you've read an aligned page of zeroes",
> which means not doing sendfile()...)
>
> And if you want to get nasty about it you can use the return value to
> posix_fallocate() the destination chunk you're about to write into,
> which is uncomfortable for a bunch of reasons. (If you hit ctrl-c at
> exactly the wrong time you can wind up with a 0 byte file taking up a
> couple gigabytes of disk space. As fallout from an optimization meant to
> save storage. No I am not adding a signal handler to call truncate(),
> let's just not go there.)
>
> Grumble grumble simple implementation...
>
> >> Eh, maybe. The point of xsendfile() is to wrap the kernel's
> >> copy_file_range() stuff for speed, and you'd THINK that would
> >> automatically do sparseness if the original was sparse but apparently
> >> not. (And that's before "what does macos do"...)
> >
> > /me checks man page
> >
> >         If fd_in is a sparse file, then copy_file_range() may expand any
> >         holes existing in the requested range.  Users may benefit from
> >         calling copy_file_range() in a loop, and using the lseek(2)
> >         SEEK_DATA and SEEK_HOLE operations to find the locations of data
> >         segments.
> >
> > d'oh!
>
> You think that's bad, look at man 2 splice. (Why do you CARE that one
> end is a pipe? Just DO it, will you? I have wanted "connect these two
> filehandles together and let the process exit so the pipeline continues
> PAST it" for DECADES (it would make netcat so much easier, it would mean
> tar didn't have to keep a second process around shoveling data once it
> had autodetected the type of a piped file...) and every time I brought
> it up on the kernel list they went "oh no, that's crazy". They
> eventually implemented mount --move and eventually gave us punch_hole()
> and the ability to find holes, but exeve(NULL, argv, envp) to re-exec
> the current running process without requiring /proc/self/exe to be
> accessible? That's crazy talk. Let's once again try to remove vfork()
> because we don't understand what it's for...)
>
> Grumble grumble. Gotta look at netbsd...
>
> >> Would there be any other users of a sparse copy function, given that tar
> >> isn't doing fd->fd copying but always has an archive format at one end?
> >
> > (given that you've reused the cp code for stuff like install, you're
> > probably right that this is the only likely user.)
>
> There's a balance between bundling multiple commands into one command.c
> and leaking implementation details into lib/ and so far both answers are
> wrong. :(
>
> (There are too many sh builtins in the same file, ps.c desperately needs
> breaking up... but lib hasn't got access to TT...)
>
> >>> and i _also_ haven't thought hard
> >>> enough about why tar.c's sparse file handling is a two-pass algorithm,
> >>> and whether that's meaningful for cp too.
> >>
> >> Not sure what you mean: sendfile_sparse() is used by the extraction
> >> code, it loops over an input array and does the thing, I think in one go?
> >
> > it was the "there's a TT.sparse array passed in" part that i meant.
>
> When extracting, the sparseness is remembered from tar metadata so the
> packed contents can be distributed accordingly. When creating, the
> sparseness needs to be recorded in the tar metadata before we go back
> and pack up the scattered data.
>
> The two passes are because the sparse info is saved in the headers
> before the data, not inline with the data. They occur physically
> separated in the file.
>
> >> Would you like cp to auto-sparse files, or only with -a, or...?
> >
> > i hadn't really thought about the auto-sparsing side. but aiui you're
> > saying coreutils _does_ try to do that,
>
> No, I was assuming you were more familiar with this command than I was
> and could just tell me what you need. "What gnu does" starts with
> reading the man page more closely, which I have now done, and "auto" is
> the default, both "never" and "always" must be explicitly requested, and
> things like -a don't affect this.
>
> Which is very gnu.
>
> > it's the preservation that it
> > doesn't do? that seems like it's doing the _more_ expensive thing? (or
> > they have some weird assumption that the cpu time for scanning each
> > block is cheap but one failed lseek() is expensive?)
>
> The gnu/dammit project was tantrumed into existence in 1983, linux hole
> detection support went in ~30 years later. They never went back and
> cleaned things up because learning better is against the gnu/philosophy.

(true, plus the "hurd" point you make elsewhere.)

> > (fwiw, it was the "lack of hole preservation by default" that caught
> > people out. i don't think they had any expectations about adding holes
> > that didn't already exist.)
>
> The sane thing would be cp's default is --sparse=never and then cp -a or
> cp -p added --sparse=auto but of course that's not what gnu did. Sigh...

(indeed.)

> Both -s and -S are used, no logical short opt for --sparse. Grumble...
>
> >> Since this is already using sendfile, the likely thing to do is make a
> >> sparse_sendfile() that calls sendfile_len() so we get the in-kernel copy
> >> of the segments. Except the OTHER fun thing here is we may be copying
> >> OVER existing files (I don't remember if tar always just deletes and
> >> recreates, or I just decided not to care there). But "cp file.img
> >> /dev/fda" was definitely a thing back in the day.
> >>
> >> So the NEXT question is what do you do if the old and new files don't
> >> agree about sparseness? You can:
> >>
> >> A) seek past annyway and leave old data in place if there was any,
> >> B) fallocate(PUNCH_HOLE) the holes, which apparently can fail
> >> C) write zeroes to blank any not previously sparse dest data, but the
> >> result isn't sparse in those places
> >> D) delete dest file if source is sparse and recreate it with lseek
> >> E) think harder about what to do
> >
> > huh ... i also hadn't thought of that. are there existing cases where
> > toybox (or coreutils) look at the _destination_ file rather than just
> > assuming the source is canonical?
>
> I hadn't tested what coreutils does, I was just asking what _should_
> happen when overwriting an existing destination file. ("What gnu does"
> is quite often insane, and if we don't NEED to be slavishly crazy I'd
> rather not be influenced by their design choices before hearing what the
> user actually wants.)
>
> Probably the correct behavior is to truncate any existing destination
> file immediately (since we're going to overwrite all of it, we just make
> an effort to retain the same inode and fuck up hardlinks). If the
> truncate fails fall back to the "copy zeroes" behavior, because we're
> writing to a block device or a pipe or something* that _can't_ represent
> sparseness in the result. (The --auto vs --always difference is about
> what properties of the SOURCE file indicate holes. The destination
> either has holes or it doesn't, there's no "some real runs of zeroes and
> some holes" option I've noticed, and I dunno why you'd want that.)
>
> * Speaking of pipes, did you know that "man 2 posix_fadvise()" says
> EINVAL comes from a bad "advice" value, but in practice the kernel
> _also_ returns that if you call it on a filehandle from /dev/urandom?
> Yes I poked Alejandro Colomar. He asked me to send a patch. I mention
> this because the man page documents ESPIPE as another thing to expect,
> but of course that's not what the kernel returns.
>
> > (even if you're ignoring that, your case B is still relevant: "is it a
> > failure to copy if you copy a sparse file but can't make the copy
> > sparse?".)
>
> No, but possibly you'd emit a warning? I remember back in the day "cp
> thingy /mnt/fatpartition" used to complain about losing metadata. (Or
> was that tar?)
>
> I kinda lean against the warning though: it's not REALLY an error, and
> there's already a warning about running out of space. (The main reason
> NOT to automatically turns runs of zeroes into sparse files is so you
> don't get disk full errors updating the middle of a file. That's why
> losetup cared back in the day. No idea what qemu does for -hda when that
> happens, probably a panic exit...)
>
> How about if you _explicitly_ ask for sparse preservation (by saying
> --sparse=auto or --sparse=always) and it can't, then warn. But if it's
> just the default, best effort and fall back silently.

(yeah, that makes sense.)

> Rob
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net