[Toybox] [PATCH] cpio: support reading concatenated cpio files.

enh enh at google.com
Thu Apr 15 10:06:44 PDT 2021


On Wed, Apr 14, 2021 at 11:49 PM Rob Landley <rob at landley.net> wrote:

> On 4/14/21 1:26 PM, enh wrote:
> >     Could you read the linux doc thing and confirm that the behavior you
> want is
> >     still to stop at TRAILER instead of flushing hardlink context but
> otherwise
> >     continuing to extract like the kernel guys documented for initramfs?
> (Or am I
> >     misremembering? It's been a while...)
> >
> > in the thread you linked to, they say "I wonder how existing GNU or BSD
> cpio ...
> > would deal with reading such a file". all i'm saying is "GNU cpio exits
> on the
> > next record boundary, and people have scripts that rely on this".
>
> My question was really "would continuing until you run out of cpio
> records, and
> exiting with an error if there were no cpio records, also satisfy those
> scripts"?
>
> > the Linux docs say things like
> >
> >   The cpio "TRAILER!!!" entry (cpio end-of-archive) is optional, but is
> >   not ignored; see "handling of hard links" below.
> >
> > but that doesn't match what actual implementations of cpio do. (assuming
> you
> > don't interpret optional as meaning "you don't have to have one, but if
> you
> > don't, the tool will exit with an error complaining that you don't have
> one" :-) )
>
> A year or three back I had it not adding the TRAILER!!! entry, then added a
> --trailer option, and you submitted a commit removing that option so it
> always
> adds TRAILER now.
>
> But not having one isn't unprecedented...
>
> > i think the most interesting thing for me in the docs was:
> >
> >   When a "TRAILER!!!" end-of-archive marker is seen, the tuple buffer is
> >   reset.  This permits archives which are generated independently to be
> >   concatenated.
> >
> > because -- even if i haven't really understood _why_ people are
> concatenating
> > cpio files -- at least this shows that the main consumers/producers
> agree that
> > this is an expected use case.
>
> They're incrementally generating filesystems, using a base cpio and then
> adding
> more entries.
>
> If your base has /dev/console style nodes in it with special ownership and
> permissions which you can't create locally as a normal user, you have to
> use an
> awkward tool like gen-initramfs-cpio from the kernel source to generate
> synthetic cpio entries. But you then often want to append a directory full
> of
> files that live in your local filesystem using normal "find | cpio".
>
> It's also a poor man's form of initramfs package management: select this
> and
> this and this without extracting them all into a temporary directory and
> then
> packaging up the directory (and potentially having
> permissions/ownership/timestamps change). This trick can even drop start
> files
> next to each other in etc/rc for sysvinit to pick up and run on boot.
>
> > i'm assuming the "exit when you see TRAILER!!! and let the next cpio
> instance
> > worry about the rest" behavior is just the least-effort implementation
> of the
> > hard-link flush stuff:
> >
> >   To combine file data from different sources (without having to
> >   regenerate the (c_maj,c_min,c_ino) fields), therefore, either one of
> >   the following techniques can be used:
> >
> >   a) Separate the different file data sources with a "TRAILER!!!"
> >      end-of-archive marker, or
> >
> > exiting when you see TRAILER!!! implicitly loses any cpio state, and
> reporting
> > an error if you hit EOF without seeing TRAILER!!! lets you know when to
> stop
> > running a new cpio?
>
> Least-effort implementation of flush is what I'm assuming too. I prefer to
> put
> in more effort and doing it right.
>
> Extracting the whole archive seems like the correct behavior because it's
> what
> the kernel initramfs plumbing does, and given that posix yanked cpio back
> in
> susv2 and Jorg "Solaris Solaris Uber Alles" Schilling got outright
> indignant
> when I suggested putting it back because it's actually _used_, that means
> the
> only modern spec we have is the kernel spec (that I am aware of).
>
> > (i think the doc is trying to distinguish between a cpio file [where
> TRAILER!!!
> > marks the end] and an "initramfs buffer" which can contain multiple
> concatenated
> > cpio files [and hence more than one TRAILER!!!]. so things processing
> initramfs
> > buffers need to be cleverer than cpio when it comes to TRAILER!!!, but
> cpio
> > doesn't. [and in practice, isn't.])
>
> I agree gnu cpio doesn't, but that's because gnu.
>
> > i think that answers your question, but perhaps in excessive detail, so
> i'll
> > re-quote you and try again:
>
> I don't mind excessive detail when I'm trying to figure out the correct
> course
> of action for a design issue.
>
> >> confirm that the behavior you want is
> >> still to stop at TRAILER instead of flushing hardlink context but
> otherwise
> >> continuing to extract
> >
> > i agree that based on the Linux docs it would be more sensible to flush
> but
> > continue, but that's demonstrably not what GNU cpio does, so it doesn't
> seem
> > particularly helpful for us to do it. callers already have to have the
> bash
> > while loop nonsense,
>
> Some callers do, and I agree we can't _break_ them. But rendering the loop
> a NOP
> doesn't break it.
>
> > and implementing the better behavior in toybox would still
> > be "broken" from that perspective because they'd loop forever --- toybox
> would
> > at least have to consider the empty input as an error,
>
> Empty input should be an error, yes. That's consistent with tar:
>
>   $ toybox cpio -i < /dev/null
>   $ echo $?
>   0
>   $ toybox tar x < /dev/null
>   tar: Not tar
>   $ echo $?
>   1
>
> Whatever else we decide to do here, making empty input be an error sounds
> correct to me. Of course the gnu/dammit version goes:
>
>   $ cpio -i < /dev/null
>   Found end of tape.  To continue, type device/file name when ready.
>
> Which... no? Just no.
>
> > at which point we haven't
> > really reduced the ugliness much? (i'm also scared to suggest anything
> beyond
> > "do what GNU does" because i don't personally know anything about cpio,
> and have
> > never used it except to generate minimal repro cases for stuff that
> kernel folks
> > bring up.
>
> I have, sadly, had to learn rather a lot about it although I don't claim
> to be
> an expert.
>
> Still, if we're changing the behavior, eating all the input seems more
> correct,
> and erroring on empty input seems like it would satisfy the loops people
> are
> using to work around the limitations in the gnu/dammit implementation (a
> limitation which is already not present in the kernel's implementation
> used by
> initramfs).
>
> > i haven't looked at BSD, but they seem to interpret TRAILER!!! as end
> > of archive too: https://www.freebsd.org/cgi/man.cgi?query=cpio&sektion=5
> ... and
>
> Have they changed the behavior of their tool in the past 25 years? (It's
> not
> like 64 bit processors or large file support means much when your file
> format is
> 8 hex digits for all the metadata fields...)
>

i had assumed that macOS used GNU cpio, but i just checked and it's
actually the BSD cpio (3.3.2). so i can report that they behave differently
again: they do stop at TRAILER!!! but putting them in the loop doesn't seem
to help, but they also don't error so you end up in the infinite loop. so
superficially similar to toybox's currently behavior, except without any
error message?


> > eighthly, carrying on past TRAILER!!! when no-one else does sounds like
> one of
> > those security issues Android had back in the "zip master key" days;
> even if the
>
> I didn't hear this story, but it sounds unpleasant.
>
> > format is stupid, it's safer when everyone interprets the format the
> same way...
> > who knows what crap people are accidentally/deliberately ignoring past a
> > TRAILER!!! that isn't actually at the end [because they _don't_ have the
> bash
> > while loop]? i'd prefer not to find out :-) )
>
> Ok, valid point. But if they feed such an initramfs into the kernel it will
> process all those records now, so the behavior _isn't_ currently
> consistent.
>

/me checks kernel source to confirm that init/initramfs.c actually does
what buffer-format.rst claims. seems to be true.


> Who are the users of this you're seeing? (And the other major user of this
> (that
> I'm aware of) is RPM package format. I don't know what do they do, because
> I
> don't know where the source to the rpm tools lives. I lost track circa
> https://lwn.net/Articles/196523/ and moved to .deb based systems
> anyway...)
>
> Hmmm. If you're really concerned about more capable default behavior being
> nebulously unsafe in a way that I can't prove a negative (grumble grumble),
> maybe it needs an --all option? The man page doesn't mention -A but of
> course:
>
>   $ cpio -iA < /dev/null
>   cpio: --append is meaningless with --extract
>   $ cpio -ia < /dev/null
>   cpio: --reset is meaningless with --extract
>
> This is gnu we're talking about: they only actually document stuff in
> "info"
> pages. Sigh. (The man page mentions --append but has no short option for
> it.
> The only reset it mentions is --reset-access-time and it doesn't say what
> that
> DOES...)
>
> Needing a for loop around the tool seems broken to me. Not breaking
> people's
> workarounds is important, but implementing behavior that WON'T while
> rendering
> the workaround unnecessary seems easy enough?
>
> People depending on a limitation of the tool for "security" is hard for me
> to
> say anything coherent about. I _want_ it to be the default behavior, but
> if it
> needs to be an option...
>

no, the fact that the kernel does interpret these files the other i think
actually flips this argument around... it would make a lot more sense if
cpio agreed.


> > hmm. my second attempt seems to have more words than my first. i'll stop
> here.
>
> Another reason the for loop creeps me out is programs read more data than
> they
> actually need from input ALL THE TIME. It's how ansi FILE * buffers work,
> and an
> input pipe isn't seekable so you can't put the data _back_ if you find
> yourself
> with extra and are about to exit.
>
> This is an implicit dependence on an implementation detail, that you can
> continue from where the previous program left off reading the same pipe
> without
> having lost anything to buffers reading ahead. (Yes, I wrote my cpio with
> fd
> rather than FILE for that reason, but DEPENDING on it? Ew.)
>

afaict this is the stackoverflow workaround for the GNU cpio behavior.

but, yeah, you've persuaded me that "behave like initramfs.c" is the way to
go. i've asked the original submitter and they got back really quickly
saying basically "that sounds great; i didn't ask for that because i
assumed you'd want to behave the same as GNU".

do you already have the "do the right thing" patch ready, or should i send
that today?


> > (i noticed as well that everyone seems to actually deal in _compressed_
> cpio
> > files, so in an ideal world i suspect cpio should be as intelligent as
> tar when
> > it comes to such things --- but i think cpio'ing is too niche to warrant
> doing
> > anything better than GNU.)
>
> The linux kernel already does better than gnu. That's why they wrote their
> own
> cpio create and extract plumbing. Create is in:
>
> https://github.com/torvalds/linux/blob/master/usr/gen_init_cpio.c
> https://github.com/torvalds/linux/blob/master/usr/gen_initramfs.sh
>
> And extract is:
>
> https://github.com/torvalds/linux/blob/master/init/initramfs.c#L256
>
> Dunno what rpm is doing behind the scenes, but the kernel guys have talked
> about
> xattr support (and sparse files, and 64 bit timestamps, and...) on more
> than one
> occasion. Hence my todo list section on that, albeit in the probably
> post-1.0
> "teach patch.c about the git file rename syntax" sort of way...
>
> Linux cpio outgrowing gnu is probably inevitable. Richard Stallman is not
> steering anything, hasn't been for decades. (He's sitting in a big chair
> making
> vroom-vroom noises with his mouth, but the wheel and pedals aren't
> connected to
> anything.)
>
> Rob
>
> P.S. sparse files are also a potential way to handle files > 4 gigs, by
> breaking
> them into segments, but this is initramfs we're talking about so people
> generally make pained noises when it comes up.
>
> P.P.S. RPM also addressed large file support, but as usual is profoundly
> unhelpful in saying exactly HOW ala
> https://rpm.org/devel_doc/large_files.html
> because their business model is to obfuscate stuff until you pay them
> thousands
> of dollars to be experts and not ask questions. I'm under the impression
> they
> named this business model "enterprise" after the way the holodeck keeps
> malfunctioning and trying to kill people. See also systemd.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20210415/e30a14d1/attachment-0001.htm>


More information about the Toybox mailing list