[Toybox] cp --sparse=
Rob Landley
rob at landley.net
Fri Feb 6 10:09:03 PST 2026
On 2/4/26 17:21, enh wrote:
> we had some recent issues where folks were copying large sparse files
> not realizing that cp does not preserve sparseness.
Nobody had asked yet. :)
> i was looking at
> implementing --sparse= in toybox cp, but i was a bit confused that i
> couldn't actually make coreutils' "auto" heuristic fire.
I am often confused by coreutils.
> if you use
> --sparse=auto it does the lseek() you'd expect (like the existing
> toybox tar.c code), and --sparse=never is obviously the current
> behavior, but i'm not sure about "auto" based on observed
> behavior/strace.
You just said if you do auto it does the lseek like you expect, but
you're not sure about auto. I'm confused.
I thought "always" sparsified a previously not-sparse file just based on
runs of zeroes in the source, and just lseeking past them in dest
(possibly with an aligned 512 byte minimum size for bothering) and
letting the kernel figure it out from there.
> i also don't really understand why coreutils is trying to be clever
> here? why not just do the lseek()?
I'm happy to just do the lseek for auto, which seems to be the default.
To be honest, if you really want to desparsify you can tar c | tar x or
for one file there's "cat > newfile". I've always been a little unclear
why this level of filesystem shenanigans is quite so user visible, it
seems like fallocate() level shenanigans? In fact punching a hole AFTER
the fact is fallocate(FALLOC_FL_PUNCH_HOLE) which I was asking about for
YEARS before they finally implemented it... and then it required
filesystem support even though creating holes with fseek doesn't...?
There's a "hardlink" tool in util-linux these days (with rather a lot of
command line options), and one of the notes I have on it is whether it
should automatically make files sparse. (Or whether some other tool
should...)
> i also kind of assume this would make your refactoring finger itch and
> want to share the sendfile_sparse() code from tar.c rather than have a
> duplicate implementation in cp.c
The problem is that's not copying from fd to fd, it's copying from data
structure to fd.
> or changing the lib xsendfile() stuff
> to take a "sparse mode" argument.
Eh, maybe. The point of xsendfile() is to wrap the kernel's
copy_file_range() stuff for speed, and you'd THINK that would
automatically do sparseness if the original was sparse but apparently
not. (And that's before "what does macos do"...)
Would there be any other users of a sparse copy function, given that tar
isn't doing fd->fd copying but always has an archive format at one end?
> and i _also_ haven't thought hard
> enough about why tar.c's sparse file handling is a two-pass algorithm,
> and whether that's meaningful for cp too.
Not sure what you mean: sendfile_sparse() is used by the extraction
code, it loops over an input array and does the thing, I think in one go?
The archive creation side has the SEEK_HOLE stuff inline in add_to_tar()
circa line 444 (comment "enumerate the extents") but the point there is
we have to write out potentially multiple S records before the data
blocks, so we need all the metadata for those header blocks before we
start sending file data (line 502).
The thing about tar is it's basically two different commands in one. The
archive creation and the archive extraction sides share very little, and
a feature generally needs to be implemented in both to be useful.
> (but i've been idly thinking about this for weeks without making any
> forward progress, so it's time to just send a braindump :-) )
Would you like cp to auto-sparse files, or only with -a, or...?
Since this is already using sendfile, the likely thing to do is make a
sparse_sendfile() that calls sendfile_len() so we get the in-kernel copy
of the segments. Except the OTHER fun thing here is we may be copying
OVER existing files (I don't remember if tar always just deletes and
recreates, or I just decided not to care there). But "cp file.img
/dev/fda" was definitely a thing back in the day.
So the NEXT question is what do you do if the old and new files don't
agree about sparseness? You can:
A) seek past annyway and leave old data in place if there was any,
B) fallocate(PUNCH_HOLE) the holes, which apparently can fail
C) write zeroes to blank any not previously sparse dest data, but the
result isn't sparse in those places
D) delete dest file if source is sparse and recreate it with lseek
E) think harder about what to do
Rob
More information about the Toybox
mailing list