[Toybox] [PATCH] cpio: support reading concatenated cpio files.

Thu Apr 15 16:57:36 PDT 2021

On 4/15/21 12:06 PM, enh wrote:
>     > i haven't looked at BSD, but they seem to interpret TRAILER!!! as end
>     > of archive too: https://www.freebsd.org/cgi/man.cgi?query=cpio&sektion=5
>     ... and
> 
>     Have they changed the behavior of their tool in the past 25 years? (It's not
>     like 64 bit processors or large file support means much when your file format is
>     8 hex digits for all the metadata fields...)
> 
> 
> i had assumed that macOS used GNU cpio, but i just checked and it's actually the
> BSD cpio (3.3.2). so i can report that they behave differently again: they do
> stop at TRAILER!!! but putting them in the loop doesn't seem to help, but they
> also don't error so you end up in the infinite loop. so superficially similar to
> toybox's currently behavior, except without any error message?

Wheee. Fixable with a toybox prebuilt, of course, once we figure out the
behavior we want.

>     Who are the users of this you're seeing? (And the other major user of this (that
>     I'm aware of) is RPM package format. I don't know what do they do, because I
>     don't know where the source to the rpm tools lives. I lost track circa
>     https://lwn.net/Articles/196523/ and moved to .deb based systems anyway...)
> 
>     Hmmm. If you're really concerned about more capable default behavior being
>     nebulously unsafe in a way that I can't prove a negative (grumble grumble),
>     maybe it needs an --all option? The man page doesn't mention -A but of course:
> 
>       $ cpio -iA < /dev/null
>       cpio: --append is meaningless with --extract
>       $ cpio -ia < /dev/null
>       cpio: --reset is meaningless with --extract
> 
>     This is gnu we're talking about: they only actually document stuff in "info"
>     pages. Sigh. (The man page mentions --append but has no short option for it.
>     The only reset it mentions is --reset-access-time and it doesn't say what that
>     DOES...)
> 
>     Needing a for loop around the tool seems broken to me. Not breaking people's
>     workarounds is important, but implementing behavior that WON'T while rendering
>     the workaround unnecessary seems easy enough?
> 
>     People depending on a limitation of the tool for "security" is hard for me to
>     say anything coherent about. I _want_ it to be the default behavior, but if it
>     needs to be an option...
> 
> 
> no, the fact that the kernel does interpret these files the other i think
> actually flips this argument around... it would make a lot more sense if cpio
> agreed.

Oh good. I'll try to implement that tonight.

So cpio should eat all records but flush hardlink status when it hits trailer,
and then exit with error on an empty input the same way tar currently does...

Darn it, cpio.tests does not currently have any hardlink tests. Lemme dig and
try to figure out WHAT state exactly TRAILER!!! is flushing. I think it's just
the link counts matching up, except you'd just add the first one to the list and
decrement as you see each one so all TRAILER would do is trigger error checking
to make sure there weren't extras? Or...? I mean, I can clear the linked list of
remembered hardlink info...

Ah, right. It's the _inode_ info. In case you're concatenating cpio snippets
produced from two different filesystems (created on two different machines,
otherwise the major/minor probably wouldn't match unless we're talking removable
media), you don't want inode collisions and potentially hardlink to the wrong thing.

Got it, I can do that. Except you STILL want to remove hardlink entries when
their link count decrements to zero, so the flushing presumably only means
anything if you had an incomplete cpio snippet...

Ah: if a file has three hardlinks and cpio tars up ONE of them, it still gets
added to the list (because link count) but then never gets cleared because you
don't see the other two. Hence TRAILER to flush to avoid collisions. Ok. I need
to add a test for this with a forced collision and then confirming the two are
NOT hardlinked because TRAILER flush happened. Right...

>     > hmm. my second attempt seems to have more words than my first. i'll stop here.
> 
>     Another reason the for loop creeps me out is programs read more data than they
>     actually need from input ALL THE TIME. It's how ansi FILE * buffers work, and an
>     input pipe isn't seekable so you can't put the data _back_ if you find yourself
>     with extra and are about to exit.
> 
>     This is an implicit dependence on an implementation detail, that you can
>     continue from where the previous program left off reading the same pipe without
>     having lost anything to buffers reading ahead. (Yes, I wrote my cpio with fd
>     rather than FILE for that reason, but DEPENDING on it? Ew.)
> 
> 
> afaict this is the stackoverflow workaround for the GNU cpio behavior.
> 
> but, yeah, you've persuaded me that "behave like initramfs.c" is the way to go.
> i've asked the original submitter and they got back really quickly saying
> basically "that sounds great; i didn't ask for that because i assumed you'd want
> to behave the same as GNU".

I want to do the right thing, which involves behaving _better_ in places. (After
careful scrutiny.)

Clay Shirky wrote a lovely piece a few years back with the line "the waterfall
method amounts to a pledge by all parties not to learn anything while doing the
actual work":

https://web.archive.org/web/20160202024237/http://www.shirky.com/weblog/2013/11/healthcare-gov-and-the-gulf-between-planning-and-reality/

A plan is a frame of reference to diverge from.

> do you already have the "do the right thing" patch ready, or should i send that
> today?

I'm good, I'll try to get it in tonight. The hard part is the test suite, not
the implementation.

Thanks,

Rob