[Toybox] [PATCH] cpio: support reading concatenated cpio files.

Sun Apr 18 00:54:55 PDT 2021

On Sun, Apr 18, 2021 at 4:59 AM Rob Landley <rob at landley.net> wrote:

> On 4/17/21 8:40 AM, Yi-yo Chiang wrote:
> > On Sat, Apr 17, 2021 at 7:32 PM Rob Landley <rob at landley.net
> > <mailto:rob at landley.net>> wrote:
> >
> >     On 4/17/21 4:43 AM, Yi-yo Chiang wrote:
> >     > On Sat, Apr 17, 2021 at 2:56 PM Rob Landley <rob at landley.net
> >     <mailto:rob at landley.net>
> >     > <mailto:rob at landley.net <mailto:rob at landley.net>>> wrote:
> >     >
> >     >     On 4/16/21 1:44 PM, Yi-yo Chiang wrote:
> >     >     > I'm not sure what Elliot's goal is? I assume he's trying to
> extract a
> >     >     > concatenated ramdisk, and I still see a problem in the
> current
> >     solution.
> >     >     >
> >     >     > The buffer-format
> >     >     >
> >     >
> >      (
> https://www.kernel.org/doc/Documentation/early-userspace/buffer-format.txt
> )
> >     >     says:
> >     >     >
> >     >     >   initramfs  := ("\0" | cpio_archive | cpio_gzip_archive)*
> >     >     >
> >     >     > In other words, both `cat a.cpio b.cpio >merged.cpio` and
> `(cat
> >     a.cpio && echo
> >     >     > -n -e '\0\0\0' && cat b.cpio) >merged.cpio` are valid
> initramfs.
> >     >
> >     >     It also implies that two compressed files can be concatenated
> and
> >     separated by
> >     >     arbirary runs of nulls, or you can have a compressed file and a
> >     non-compressed
> >     >     file concatenated, or...
> >     >
> >     >
> >     > Correct. Upon further inspection, it's actually "arbitrary NULLs
> could
> >     prepend a
> >     > GZIP(cpio_archive)",
> >
> >     I'm not currently handling that case, and I'm not sure where is the
> right place
> >     to handle it? (Should gzip handle it, or should cpio call out to
> gzip?)
> >
> >     And then you have to care that the _compressor_ stops gracefully at
> the end of
> >     its compressed data isn't reading/discarding extra from its input...
> >
> >
> > I just read more into the kernel initramfs.c and decompressor_*.c, and
> seems
> > like even the kernel doesn't handle this all that well.
> > For example, the gzip decompressor (inflate) stops gracefully at the end
> of
> > compressed data, but lz4 decompressor doesn't and errors when there is
> data past
> > the end of compressed data.
>
> It's possible to make this work right, but not _easy_ to do so, because of
> the
> read buffer issues.
>
> > Back to the original question, I think handling concatenated
> uncompressed cpio
> > is good enough.
>
> In theory that's in now.
>
> >     I can teach my cpio to call out to decompressors, but this is new
> design that
> >     needs to be thought through. Does it automatically do it, is there a
> new flag?
> >     Is this decompression side only and the compression side still needs
> its output
> >     piped?
> >
> > I highly doubt zcat support initramfs-style concatenated .gz.
>
> I wrote my own lib/deflate.c from scratch (keep meaning to finish the
> compressor
> side but my todo list runneth over and toybox is not my day job), so I'm
> pretty
> sure I can make it handle multiple concatenated files.
>
> And I vaguely recall that zlib's version of handling non-compressed data
> was to
> send it through to the output verbatim. (Which means if you have a tarball
> containing a gzip file things could get ugly.)
>
> > AFAICT, in order
> > to deal with "(cat a.cpio.gz && echo -n -e '\0\0' && cat
> > b.cpio.gz)>initramfs.img", right now we need to use tools such as
> binwalk && dd
> > to slice the initramfs.img into its individual components, and then pipe
> the
> > sliced chunks into zcat, lz4cat ... whatever-cat. It sure sounds useful
> for cpio
> > to have an option or flag (like tar) to let it auto detect the
> compression
> > method and call the compression library.
>
> Toybox tar doesn't use compression libraries for this, it forks another
> process
> and feeds data through a pipe. (Which gets us SMP automatically, and means
> we
> can use compression types we don't internally implement.) That said, I
> could
> easily add a --showsize option that prints the number of bytes of input
> consumed
> to the three compressors toybox implements. (Not that this is useful
> because I
> don't _require_ using the toybox compressors/decompressors, for
> interoperability
> reasons, so can't depend on a feature I'd add.)
>
> The problem isn't figuring out where the data _starts_, it's figuring out
> where
> it _ends_. Long ago I had a design for a parallel bzip2 decompressor that
> would
> search ahead in the code for the next bzip2 start of block signature and
> dispatch each chunk to a thread pool (and then only keep the results when
> the
> previous block said "we ended here" and that was one of the starting
> points;
> ones that got bypassed because false positive in the middle of a block
> would
> just have their output discarded).
>
> But at this point bzip2 is obsolete enough I'm unlikely to bother even
> when I
> make it that far down on my todo list. That said, it shouldn't be too hard
> to do
> something similar for gzip? (The start of block has to be byte aligned.)
> And
> _specifically_ this would be "figure out where the next compression start
> is,
> keep the last X kilobytes of data we fed to the decompressor pipe, and
> start the
> next one at the last compressor start signature match near the ending point
> where the decompressor gave up.
>
> Which still doesn't help you with _decompressed_ data between compressed
> runs...
>
> What might be useful is to special case gzip, andn handle that with the
> internal
> deflate code, which can accurately measure the number of bytes consumed and
> preserve any data after that. And then just say that's the ONLY
> compression type
> that concatenation works for.
>
> I could also just teach gzip that you can concatenate .gz files,

This has always been the case. AFAIK this is an explicit feature of gzip.

This:
  gzip -c a >a.gz
  gzip -c b >b.gz
  cat a.gz b.gz > ab.gz
  zcat ab.gz >ab
and this:
  cat a b >ab
have the same effect.

Thus supporting concatenated .cpio.gz is as simple as:
  zcat a.cpio.gz b.cpio.gz | cpio -i

and say we
> support concatenating file.cpio.gz or file.cpio but don't mix them.
>

Concatenating .cpio.gz or .cpio without mixing is already supported (with
the latest "go past TRAILER!!!" change, cheers!) The problematic case is
when NUL-paddings are thrown into the mix. I think calling native inflate
code to determine the start and end offset of compressed region sounds
good, as DEFLATE seems to encode an explicit END marker so the end offset
is well-defined. LZ4 on the other hand, does not; lz4 (legacy format used
by kernel) uses EOF to tell the end of compressed stream, thus making it
diffficult to determine the actual boundary when we are concating
compressed and uncompressed data. Other formats I don't know.

>
> Rob
>

-- 

Yi-yo Chiang
Software Engineer
yochiang at google.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20210418/427d123d/attachment-0001.html>