[Toybox] [PATCH] cpio: support reading concatenated cpio files.

Sat Apr 17 14:14:18 PDT 2021

On 4/17/21 8:40 AM, Yi-yo Chiang wrote:
> On Sat, Apr 17, 2021 at 7:32 PM Rob Landley <rob at landley.net
> <mailto:rob at landley.net>> wrote:
> 
>     On 4/17/21 4:43 AM, Yi-yo Chiang wrote:
>     > On Sat, Apr 17, 2021 at 2:56 PM Rob Landley <rob at landley.net
>     <mailto:rob at landley.net>
>     > <mailto:rob at landley.net <mailto:rob at landley.net>>> wrote:
>     >
>     >     On 4/16/21 1:44 PM, Yi-yo Chiang wrote:
>     >     > I'm not sure what Elliot's goal is? I assume he's trying to extract a
>     >     > concatenated ramdisk, and I still see a problem in the current
>     solution. 
>     >     >
>     >     > The buffer-format
>     >     >
>     >   
>      (https://www.kernel.org/doc/Documentation/early-userspace/buffer-format.txt)
>     >     says:
>     >     >
>     >     >   initramfs  := ("\0" | cpio_archive | cpio_gzip_archive)*
>     >     >
>     >     > In other words, both `cat a.cpio b.cpio >merged.cpio` and `(cat
>     a.cpio && echo
>     >     > -n -e '\0\0\0' && cat b.cpio) >merged.cpio` are valid initramfs.
>     >
>     >     It also implies that two compressed files can be concatenated and
>     separated by
>     >     arbirary runs of nulls, or you can have a compressed file and a
>     non-compressed
>     >     file concatenated, or...
>     >
>     >
>     > Correct. Upon further inspection, it's actually "arbitrary NULLs could
>     prepend a
>     > GZIP(cpio_archive)",
> 
>     I'm not currently handling that case, and I'm not sure where is the right place
>     to handle it? (Should gzip handle it, or should cpio call out to gzip?)
> 
>     And then you have to care that the _compressor_ stops gracefully at the end of
>     its compressed data isn't reading/discarding extra from its input...
> 
> 
> I just read more into the kernel initramfs.c and decompressor_*.c, and seems
> like even the kernel doesn't handle this all that well.
> For example, the gzip decompressor (inflate) stops gracefully at the end of
> compressed data, but lz4 decompressor doesn't and errors when there is data past
> the end of compressed data.

It's possible to make this work right, but not _easy_ to do so, because of the
read buffer issues.

> Back to the original question, I think handling concatenated uncompressed cpio
> is good enough.

In theory that's in now.

>     I can teach my cpio to call out to decompressors, but this is new design that
>     needs to be thought through. Does it automatically do it, is there a new flag?
>     Is this decompression side only and the compression side still needs its output
>     piped?
> 
> I highly doubt zcat support initramfs-style concatenated .gz.

I wrote my own lib/deflate.c from scratch (keep meaning to finish the compressor
side but my todo list runneth over and toybox is not my day job), so I'm pretty
sure I can make it handle multiple concatenated files.

And I vaguely recall that zlib's version of handling non-compressed data was to
send it through to the output verbatim. (Which means if you have a tarball
containing a gzip file things could get ugly.)

> AFAICT, in order
> to deal with "(cat a.cpio.gz && echo -n -e '\0\0' && cat
> b.cpio.gz)>initramfs.img", right now we need to use tools such as binwalk && dd
> to slice the initramfs.img into its individual components, and then pipe the
> sliced chunks into zcat, lz4cat ... whatever-cat. It sure sounds useful for cpio
> to have an option or flag (like tar) to let it auto detect the compression
> method and call the compression library.

Toybox tar doesn't use compression libraries for this, it forks another process
and feeds data through a pipe. (Which gets us SMP automatically, and means we
can use compression types we don't internally implement.) That said, I could
easily add a --showsize option that prints the number of bytes of input consumed
to the three compressors toybox implements. (Not that this is useful because I
don't _require_ using the toybox compressors/decompressors, for interoperability
reasons, so can't depend on a feature I'd add.)

The problem isn't figuring out where the data _starts_, it's figuring out where
it _ends_. Long ago I had a design for a parallel bzip2 decompressor that would
search ahead in the code for the next bzip2 start of block signature and
dispatch each chunk to a thread pool (and then only keep the results when the
previous block said "we ended here" and that was one of the starting points;
ones that got bypassed because false positive in the middle of a block would
just have their output discarded).

But at this point bzip2 is obsolete enough I'm unlikely to bother even when I
make it that far down on my todo list. That said, it shouldn't be too hard to do
something similar for gzip? (The start of block has to be byte aligned.) And
_specifically_ this would be "figure out where the next compression start is,
keep the last X kilobytes of data we fed to the decompressor pipe, and start the
next one at the last compressor start signature match near the ending point
where the decompressor gave up.

Which still doesn't help you with _decompressed_ data between compressed runs...

What might be useful is to special case gzip, andn handle that with the internal
deflate code, which can accurately measure the number of bytes consumed and
preserve any data after that. And then just say that's the ONLY compression type
that concatenation works for.

I could also just teach gzip that you can concatenate .gz files, and say we
support concatenating file.cpio.gz or file.cpio but don't mix them.

Rob