[Toybox] [PATCH] cpio: support reading concatenated cpio files.

Rob Landley rob at landley.net
Thu Apr 15 00:04:24 PDT 2021


On 4/14/21 1:26 PM, enh wrote:
>     Could you read the linux doc thing and confirm that the behavior you want is
>     still to stop at TRAILER instead of flushing hardlink context but otherwise
>     continuing to extract like the kernel guys documented for initramfs? (Or am I
>     misremembering? It's been a while...)
> 
> in the thread you linked to, they say "I wonder how existing GNU or BSD cpio ...
> would deal with reading such a file". all i'm saying is "GNU cpio exits on the
> next record boundary, and people have scripts that rely on this".

My question was really "would continuing until you run out of cpio records, and
exiting with an error if there were no cpio records, also satisfy those scripts"?

> the Linux docs say things like
> 
>   The cpio "TRAILER!!!" entry (cpio end-of-archive) is optional, but is
>   not ignored; see "handling of hard links" below.
> 
> but that doesn't match what actual implementations of cpio do. (assuming you
> don't interpret optional as meaning "you don't have to have one, but if you
> don't, the tool will exit with an error complaining that you don't have one" :-) )

A year or three back I had it not adding the TRAILER!!! entry, then added a
--trailer option, and you submitted a commit removing that option so it always
adds TRAILER now.

But not having one isn't unprecedented...

> i think the most interesting thing for me in the docs was:
> 
>   When a "TRAILER!!!" end-of-archive marker is seen, the tuple buffer is
>   reset.  This permits archives which are generated independently to be
>   concatenated.
>  
> because -- even if i haven't really understood _why_ people are concatenating
> cpio files -- at least this shows that the main consumers/producers agree that
> this is an expected use case.

They're incrementally generating filesystems, using a base cpio and then adding
more entries.

If your base has /dev/console style nodes in it with special ownership and
permissions which you can't create locally as a normal user, you have to use an
awkward tool like gen-initramfs-cpio from the kernel source to generate
synthetic cpio entries. But you then often want to append a directory full of
files that live in your local filesystem using normal "find | cpio".

It's also a poor man's form of initramfs package management: select this and
this and this without extracting them all into a temporary directory and then
packaging up the directory (and potentially having
permissions/ownership/timestamps change). This trick can even drop start files
next to each other in etc/rc for sysvinit to pick up and run on boot.

> i'm assuming the "exit when you see TRAILER!!! and let the next cpio instance
> worry about the rest" behavior is just the least-effort implementation of the
> hard-link flush stuff:
> 
>   To combine file data from different sources (without having to
>   regenerate the (c_maj,c_min,c_ino) fields), therefore, either one of
>   the following techniques can be used:
> 
>   a) Separate the different file data sources with a "TRAILER!!!"
>      end-of-archive marker, or
> 
> exiting when you see TRAILER!!! implicitly loses any cpio state, and reporting
> an error if you hit EOF without seeing TRAILER!!! lets you know when to stop
> running a new cpio?

Least-effort implementation of flush is what I'm assuming too. I prefer to put
in more effort and doing it right.

Extracting the whole archive seems like the correct behavior because it's what
the kernel initramfs plumbing does, and given that posix yanked cpio back in
susv2 and Jorg "Solaris Solaris Uber Alles" Schilling got outright indignant
when I suggested putting it back because it's actually _used_, that means the
only modern spec we have is the kernel spec (that I am aware of).

> (i think the doc is trying to distinguish between a cpio file [where TRAILER!!!
> marks the end] and an "initramfs buffer" which can contain multiple concatenated
> cpio files [and hence more than one TRAILER!!!]. so things processing initramfs
> buffers need to be cleverer than cpio when it comes to TRAILER!!!, but cpio
> doesn't. [and in practice, isn't.])

I agree gnu cpio doesn't, but that's because gnu.

> i think that answers your question, but perhaps in excessive detail, so i'll
> re-quote you and try again:

I don't mind excessive detail when I'm trying to figure out the correct course
of action for a design issue.

>> confirm that the behavior you want is
>> still to stop at TRAILER instead of flushing hardlink context but otherwise
>> continuing to extract
> 
> i agree that based on the Linux docs it would be more sensible to flush but
> continue, but that's demonstrably not what GNU cpio does, so it doesn't seem
> particularly helpful for us to do it. callers already have to have the bash
> while loop nonsense,

Some callers do, and I agree we can't _break_ them. But rendering the loop a NOP
doesn't break it.

> and implementing the better behavior in toybox would still
> be "broken" from that perspective because they'd loop forever --- toybox would
> at least have to consider the empty input as an error,

Empty input should be an error, yes. That's consistent with tar:

  $ toybox cpio -i < /dev/null
  $ echo $?
  0
  $ toybox tar x < /dev/null
  tar: Not tar
  $ echo $?
  1

Whatever else we decide to do here, making empty input be an error sounds
correct to me. Of course the gnu/dammit version goes:

  $ cpio -i < /dev/null
  Found end of tape.  To continue, type device/file name when ready.

Which... no? Just no.

> at which point we haven't
> really reduced the ugliness much? (i'm also scared to suggest anything beyond
> "do what GNU does" because i don't personally know anything about cpio, and have
> never used it except to generate minimal repro cases for stuff that kernel folks
> bring up.

I have, sadly, had to learn rather a lot about it although I don't claim to be
an expert.

Still, if we're changing the behavior, eating all the input seems more correct,
and erroring on empty input seems like it would satisfy the loops people are
using to work around the limitations in the gnu/dammit implementation (a
limitation which is already not present in the kernel's implementation used by
initramfs).

> i haven't looked at BSD, but they seem to interpret TRAILER!!! as end
> of archive too: https://www.freebsd.org/cgi/man.cgi?query=cpio&sektion=5 ... and

Have they changed the behavior of their tool in the past 25 years? (It's not
like 64 bit processors or large file support means much when your file format is
8 hex digits for all the metadata fields...)

> eighthly, carrying on past TRAILER!!! when no-one else does sounds like one of
> those security issues Android had back in the "zip master key" days; even if the

I didn't hear this story, but it sounds unpleasant.

> format is stupid, it's safer when everyone interprets the format the same way...
> who knows what crap people are accidentally/deliberately ignoring past a
> TRAILER!!! that isn't actually at the end [because they _don't_ have the bash
> while loop]? i'd prefer not to find out :-) )

Ok, valid point. But if they feed such an initramfs into the kernel it will
process all those records now, so the behavior _isn't_ currently consistent.

Who are the users of this you're seeing? (And the other major user of this (that
I'm aware of) is RPM package format. I don't know what do they do, because I
don't know where the source to the rpm tools lives. I lost track circa
https://lwn.net/Articles/196523/ and moved to .deb based systems anyway...)

Hmmm. If you're really concerned about more capable default behavior being
nebulously unsafe in a way that I can't prove a negative (grumble grumble),
maybe it needs an --all option? The man page doesn't mention -A but of course:

  $ cpio -iA < /dev/null
  cpio: --append is meaningless with --extract
  $ cpio -ia < /dev/null
  cpio: --reset is meaningless with --extract

This is gnu we're talking about: they only actually document stuff in "info"
pages. Sigh. (The man page mentions --append but has no short option for it.
The only reset it mentions is --reset-access-time and it doesn't say what that
DOES...)

Needing a for loop around the tool seems broken to me. Not breaking people's
workarounds is important, but implementing behavior that WON'T while rendering
the workaround unnecessary seems easy enough?

People depending on a limitation of the tool for "security" is hard for me to
say anything coherent about. I _want_ it to be the default behavior, but if it
needs to be an option...

> hmm. my second attempt seems to have more words than my first. i'll stop here.

Another reason the for loop creeps me out is programs read more data than they
actually need from input ALL THE TIME. It's how ansi FILE * buffers work, and an
input pipe isn't seekable so you can't put the data _back_ if you find yourself
with extra and are about to exit.

This is an implicit dependence on an implementation detail, that you can
continue from where the previous program left off reading the same pipe without
having lost anything to buffers reading ahead. (Yes, I wrote my cpio with fd
rather than FILE for that reason, but DEPENDING on it? Ew.)

> (i noticed as well that everyone seems to actually deal in _compressed_ cpio
> files, so in an ideal world i suspect cpio should be as intelligent as tar when
> it comes to such things --- but i think cpio'ing is too niche to warrant doing
> anything better than GNU.)

The linux kernel already does better than gnu. That's why they wrote their own
cpio create and extract plumbing. Create is in:

https://github.com/torvalds/linux/blob/master/usr/gen_init_cpio.c
https://github.com/torvalds/linux/blob/master/usr/gen_initramfs.sh

And extract is:

https://github.com/torvalds/linux/blob/master/init/initramfs.c#L256

Dunno what rpm is doing behind the scenes, but the kernel guys have talked about
xattr support (and sparse files, and 64 bit timestamps, and...) on more than one
occasion. Hence my todo list section on that, albeit in the probably post-1.0
"teach patch.c about the git file rename syntax" sort of way...

Linux cpio outgrowing gnu is probably inevitable. Richard Stallman is not
steering anything, hasn't been for decades. (He's sitting in a big chair making
vroom-vroom noises with his mouth, but the wheel and pedals aren't connected to
anything.)

Rob

P.S. sparse files are also a potential way to handle files > 4 gigs, by breaking
them into segments, but this is initramfs we're talking about so people
generally make pained noises when it comes up.

P.P.S. RPM also addressed large file support, but as usual is profoundly
unhelpful in saying exactly HOW ala https://rpm.org/devel_doc/large_files.html
because their business model is to obfuscate stuff until you pay them thousands
of dollars to be experts and not ask questions. I'm under the impression they
named this business model "enterprise" after the way the holodeck keeps
malfunctioning and trying to kill people. See also systemd.


More information about the Toybox mailing list