[Toybox] cp --sparse=

enh enh at google.com
Fri Feb 13 13:10:50 PST 2026


On Fri, Feb 6, 2026 at 1:12 PM Rob Landley <rob at landley.net> wrote:
>
> On 2/4/26 17:21, enh wrote:
> > we had some recent issues where folks were copying large sparse files
> > not realizing that cp does not preserve sparseness.
>
> Nobody had asked yet. :)
>
> > i was looking at
> > implementing --sparse= in toybox cp, but i was a bit confused that i
> > couldn't actually make coreutils' "auto" heuristic fire.
>
> I am often confused by coreutils.
>
> > if you use
> > --sparse=auto it does the lseek() you'd expect (like the existing
> > toybox tar.c code), and --sparse=never is obviously the current
> > behavior, but i'm not sure about "auto" based on observed
> > behavior/strace.
>
> You just said if you do auto it does the lseek like you expect, but
> you're not sure about auto. I'm confused.

yeah, sorry --- auto does _not_ do the lseek(). always does.

> I thought "always" sparsified a previously not-sparse file just based on
> runs of zeroes in the source, and just lseeking past them in dest
> (possibly with an aligned 512 byte minimum size for bothering) and
> letting the kernel figure it out from there.

huh ... maybe the other way around? that would explain why "auto" does
_not_ do the lseek() to look for gaps, if it's just looking for blocks
of zeroes. (though tbh, if asked to guess what the behavior would be,
i'd have assumed the opposite!)

> > i also don't really understand why coreutils is trying to be clever
> > here? why not just do the lseek()?
>
> I'm happy to just do the lseek for auto, which seems to be the default.
>
> To be honest, if you really want to desparsify you can tar c | tar x or
> for one file there's "cat > newfile". I've always been a little unclear
> why this level of filesystem shenanigans is quite so user visible, it
> seems like fallocate() level shenanigans? In fact punching a hole AFTER
> the fact is fallocate(FALLOC_FL_PUNCH_HOLE) which I was asking about for
> YEARS before they finally implemented it... and then it required
> filesystem support even though creating holes with fseek doesn't...?

yeah, given that the "is there at least one hole?" test is just a
single lseek(), it's odd to me that it's not the coreutils default. i
feel like "preserve holes" and "make holes wherever possible" are two
unrelated behaviors.

> There's a "hardlink" tool in util-linux these days (with rather a lot of
> command line options), and one of the notes I have on it is whether it
> should automatically make files sparse. (Or whether some other tool
> should...)
>
> > i also kind of assume this would make your refactoring finger itch and
> > want to share the sendfile_sparse() code from tar.c rather than have a
> > duplicate implementation in cp.c
>
> The problem is that's not copying from fd to fd, it's copying from data
> structure to fd.
>
> > or changing the lib xsendfile() stuff
> > to take a "sparse mode" argument.
>
> Eh, maybe. The point of xsendfile() is to wrap the kernel's
> copy_file_range() stuff for speed, and you'd THINK that would
> automatically do sparseness if the original was sparse but apparently
> not. (And that's before "what does macos do"...)

/me checks man page

       If fd_in is a sparse file, then copy_file_range() may expand any
       holes existing in the requested range.  Users may benefit from
       calling copy_file_range() in a loop, and using the lseek(2)
       SEEK_DATA and SEEK_HOLE operations to find the locations of data
       segments.

d'oh!

> Would there be any other users of a sparse copy function, given that tar
> isn't doing fd->fd copying but always has an archive format at one end?

(given that you've reused the cp code for stuff like install, you're
probably right that this is the only likely user.)

> > and i _also_ haven't thought hard
> > enough about why tar.c's sparse file handling is a two-pass algorithm,
> > and whether that's meaningful for cp too.
>
> Not sure what you mean: sendfile_sparse() is used by the extraction
> code, it loops over an input array and does the thing, I think in one go?

it was the "there's a TT.sparse array passed in" part that i meant.

> The archive creation side has the SEEK_HOLE stuff inline in add_to_tar()
> circa line 444 (comment "enumerate the extents") but the point there is
> we have to write out potentially multiple S records before the data
> blocks, so we need all the metadata for those header blocks before we
> start sending file data (line 502).
>
> The thing about tar is it's basically two different commands in one. The
> archive creation and the archive extraction sides share very little, and
> a feature generally needs to be implemented in both to be useful.
>
> > (but i've been idly thinking about this for weeks without making any
> > forward progress, so it's time to just send a braindump :-) )
>
> Would you like cp to auto-sparse files, or only with -a, or...?

i hadn't really thought about the auto-sparsing side. but aiui you're
saying coreutils _does_ try to do that, it's the preservation that it
doesn't do? that seems like it's doing the _more_ expensive thing? (or
they have some weird assumption that the cpu time for scanning each
block is cheap but one failed lseek() is expensive?)

(fwiw, it was the "lack of hole preservation by default" that caught
people out. i don't think they had any expectations about adding holes
that didn't already exist.)

> Since this is already using sendfile, the likely thing to do is make a
> sparse_sendfile() that calls sendfile_len() so we get the in-kernel copy
> of the segments. Except the OTHER fun thing here is we may be copying
> OVER existing files (I don't remember if tar always just deletes and
> recreates, or I just decided not to care there). But "cp file.img
> /dev/fda" was definitely a thing back in the day.
>
> So the NEXT question is what do you do if the old and new files don't
> agree about sparseness? You can:
>
> A) seek past annyway and leave old data in place if there was any,
> B) fallocate(PUNCH_HOLE) the holes, which apparently can fail
> C) write zeroes to blank any not previously sparse dest data, but the
> result isn't sparse in those places
> D) delete dest file if source is sparse and recreate it with lseek
> E) think harder about what to do

huh ... i also hadn't thought of that. are there existing cases where
toybox (or coreutils) look at the _destination_ file rather than just
assuming the source is canonical?

(even if you're ignoring that, your case B is still relevant: "is it a
failure to copy if you copy a sparse file but can't make the copy
sparse?".)

> Rob
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net


More information about the Toybox mailing list