[Toybox] cp --sparse=

Fri Feb 6 10:09:03 PST 2026

On 2/4/26 17:21, enh wrote:
> we had some recent issues where folks were copying large sparse files
> not realizing that cp does not preserve sparseness.

Nobody had asked yet. :)

> i was looking at
> implementing --sparse= in toybox cp, but i was a bit confused that i
> couldn't actually make coreutils' "auto" heuristic fire.

I am often confused by coreutils.

> if you use
> --sparse=auto it does the lseek() you'd expect (like the existing
> toybox tar.c code), and --sparse=never is obviously the current
> behavior, but i'm not sure about "auto" based on observed
> behavior/strace.

You just said if you do auto it does the lseek like you expect, but 
you're not sure about auto. I'm confused.

I thought "always" sparsified a previously not-sparse file just based on 
runs of zeroes in the source, and just lseeking past them in dest 
(possibly with an aligned 512 byte minimum size for bothering) and 
letting the kernel figure it out from there.

> i also don't really understand why coreutils is trying to be clever
> here? why not just do the lseek()?

I'm happy to just do the lseek for auto, which seems to be the default.

To be honest, if you really want to desparsify you can tar c | tar x or 
for one file there's "cat > newfile". I've always been a little unclear 
why this level of filesystem shenanigans is quite so user visible, it 
seems like fallocate() level shenanigans? In fact punching a hole AFTER 
the fact is fallocate(FALLOC_FL_PUNCH_HOLE) which I was asking about for 
YEARS before they finally implemented it... and then it required 
filesystem support even though creating holes with fseek doesn't...?

There's a "hardlink" tool in util-linux these days (with rather a lot of 
command line options), and one of the notes I have on it is whether it 
should automatically make files sparse. (Or whether some other tool 
should...)

> i also kind of assume this would make your refactoring finger itch and
> want to share the sendfile_sparse() code from tar.c rather than have a
> duplicate implementation in cp.c

The problem is that's not copying from fd to fd, it's copying from data 
structure to fd.

> or changing the lib xsendfile() stuff
> to take a "sparse mode" argument.

Eh, maybe. The point of xsendfile() is to wrap the kernel's 
copy_file_range() stuff for speed, and you'd THINK that would 
automatically do sparseness if the original was sparse but apparently 
not. (And that's before "what does macos do"...)

Would there be any other users of a sparse copy function, given that tar 
isn't doing fd->fd copying but always has an archive format at one end?

> and i _also_ haven't thought hard
> enough about why tar.c's sparse file handling is a two-pass algorithm,
> and whether that's meaningful for cp too.

Not sure what you mean: sendfile_sparse() is used by the extraction 
code, it loops over an input array and does the thing, I think in one go?

The archive creation side has the SEEK_HOLE stuff inline in add_to_tar() 
circa line 444 (comment "enumerate the extents") but the point there is 
we have to write out potentially multiple S records before the data 
blocks, so we need all the metadata for those header blocks before we 
start sending file data (line 502).

The thing about tar is it's basically two different commands in one. The 
archive creation and the archive extraction sides share very little, and 
a feature generally needs to be implemented in both to be useful.

> (but i've been idly thinking about this for weeks without making any
> forward progress, so it's time to just send a braindump :-) )

Would you like cp to auto-sparse files, or only with -a, or...?

Since this is already using sendfile, the likely thing to do is make a 
sparse_sendfile() that calls sendfile_len() so we get the in-kernel copy 
of the segments. Except the OTHER fun thing here is we may be copying 
OVER existing files (I don't remember if tar always just deletes and 
recreates, or I just decided not to care there). But "cp file.img 
/dev/fda" was definitely a thing back in the day.

So the NEXT question is what do you do if the old and new files don't 
agree about sparseness? You can:

A) seek past annyway and leave old data in place if there was any,
B) fallocate(PUNCH_HOLE) the holes, which apparently can fail
C) write zeroes to blank any not previously sparse dest data, but the 
result isn't sparse in those places
D) delete dest file if source is sparse and recreate it with lseek
E) think harder about what to do

Rob