[Toybox] [PATCH] cpio: support reading concatenated cpio files.

Sat Apr 17 06:40:58 PDT 2021

On Sat, Apr 17, 2021 at 7:32 PM Rob Landley <rob at landley.net> wrote:

> On 4/17/21 4:43 AM, Yi-yo Chiang wrote:
> > On Sat, Apr 17, 2021 at 2:56 PM Rob Landley <rob at landley.net
> > <mailto:rob at landley.net>> wrote:
> >
> >     On 4/16/21 1:44 PM, Yi-yo Chiang wrote:
> >     > I'm not sure what Elliot's goal is? I assume he's trying to
> extract a
> >     > concatenated ramdisk, and I still see a problem in the current
> solution.
> >     >
> >     > The buffer-format
> >     >
> >     (
> https://www.kernel.org/doc/Documentation/early-userspace/buffer-format.txt
> )
> >     says:
> >     >
> >     >   initramfs  := ("\0" | cpio_archive | cpio_gzip_archive)*
> >     >
> >     > In other words, both `cat a.cpio b.cpio >merged.cpio` and `(cat
> a.cpio && echo
> >     > -n -e '\0\0\0' && cat b.cpio) >merged.cpio` are valid initramfs.
> >
> >     It also implies that two compressed files can be concatenated and
> separated by
> >     arbirary runs of nulls, or you can have a compressed file and a
> non-compressed
> >     file concatenated, or...
> >
> >
> > Correct. Upon further inspection, it's actually "arbitrary NULLs could
> prepend a
> > GZIP(cpio_archive)",
>
> I'm not currently handling that case, and I'm not sure where is the right
> place
> to handle it? (Should gzip handle it, or should cpio call out to gzip?)
>
> And then you have to care that the _compressor_ stops gracefully at the
> end of
> its compressed data isn't reading/discarding extra from its input...
>
>
I just read more into the kernel initramfs.c and decompressor_*.c, and
seems like even the kernel doesn't handle this all that well.
For example, the gzip decompressor (inflate) stops gracefully at the end of
compressed data, but lz4 decompressor doesn't and errors when there is data
past the end of compressed data.
So even though "cat a.cpio.gz b.cpio.lz4 >c.ramdisk" and "cat a.cpio.lz4
b.cpio.gz >c.ramdisk" both follow the initramfs grammar, the kernel can
only boot the former case. I even found a bug describing the same issue (
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840945) but it kind
of went out of everyone's attention.

Back to the original question, I think handling concatenated uncompressed
cpio is good enough. I can't comment too much on concatenated mixed
compressed cpio as I'm unfamiliar with all those different compression
algorithms, but since even the kernel doesn't fully support this
configuration, I guess there isn't much use case out there.

I think it's safe to say that for the majority of use cases, it's
sufficient to pipe the output of zcat or lz4cat into "cpio -i" to unpack
the initramfs, that is, the ramdisk is usually formed by concatenating
multiple compressed archives without padding ( COMPRESS((cpio + alignment)
* N), but not (COMPRESS((cpio + alignment) * N) + alignment) * M ).

> "arbitrary 4-aligned NULLS prepend a *uncompressed*
> > cpio_archive"
>
> This case it should be handling now.
>
> > and "cpio_file/cpio_trailer within a cpio_archive have to be
> > 4-aligned with arbitrary NULLs". initramfs.c seems to try very hard to
> respect
> > the alignment requirement, but I guess we could just skip *ANY* extra
> NULLs for
> > simplicity?
>
> It was already 4-aligned. That's part of the file specification. Padding
> with
> _more_ than that was throwing it off, though. Should handle it now?
>
> (Let me know what other tests I should add to tests/cpio.test.)
>
> >     Grrr. I need to test this. And possibly genericize the tar.c code to
> detect
> >     compression type and run it through a decompressor so cpio can do it
> too...
> >
> >
> > Sounds like another can of worms... :/
>
> Indeed. Haven't started that yet because tar.c is already doing it and I
> want to
> factor out common code from that, ala:
>
>         // detect gzip and bzip signatures
>         if (SWAP_BE16(*(short *)hdr)==0x1f8b) toys.optflags |= FLAG_z;
>         else if (!memcmp(hdr, "BZh", 3)) toys.optflags |= FLAG_j;
>         else if (peek_be(hdr, 7) == 0xfd377a585a0000UL) toys.optflags |=
> FLAG_J;
>         else error_exit("Not tar");
>
> > The buffer-format.txt seems to be a bit outdated, as Linux now supports
> a lot of
> > compression types besides gzip, and all of which are configurable
> > (https://elixir.bootlin.com/linux/latest/source/lib/decompress.c#L52).
> So the
> > initramfs grammar implemented by initramfs.c is in reality:
> >
> >   initramfs  := ("\0" | cpio_archive | compressed_cpio_archive)*
> >   compressed_cpio_archive := CONFIG_COMPRESSION_ALGORITHM(cpio_archive)
> >   CONFIG_COMPRESSION_ALGORITHM := GZIP | BZIP2 | LZMA | XZ | LZO | LZ4 |
> ZSTD
> >
> > where the exact set of compression algorithms are decided by the kernel
> config.
>
> Exactly. Toybox knows about gzip, bzip, and xz. (The only compressor I
> currently
> plan to natively support is gzip, but it has decompressors for the other
> two.
> The xz one is a bit stale and still in pending and needs serious cleanup,
> but
> was sourced from public domain code.)
>
> I can add more, but it hadn't previously come up?
>
> Also, I'm really fuzzy on the difference between xz/lzma/lzo/lz4/zstd.
>
> >     > btw gen_init_cpio.c also pads initramfs to 512-byte boundary
> >     >
> >     (
> https://github.com/torvalds/linux/blob/6fbd6cf85a3be127454a1ad58525a3adcf8612ab/usr/gen_init_cpio.c#L97
> )
> >
> >     *blink* *blink* Why...? (cpio doesn't have a 512 stride in the file
> format? It
> >     has a 4-byte stride for padding strings with NUL bytes, but that's
> about it?)
> >
> >     > If we're viewing buffer-format.txt as the "right" cpio spec, then
> I think we
> >     > should implement this too. We should skip arbitrary
> extra NUL-bytes padded
> >     > between cpio file frames
> >
> >     Skipping arbitrary extra null bytes at the start is easy enough to
> do. I guess
> >     the hardwired trailing read was expecting the 512 padding...
> >
> >     I'm gonna need add a _lot_ more test suite entries for this command.
> >
> >     Ok, skip arbitrary leading NUL bytes after each entry, pad last
> record to 512
> >     byte alignment with NUL bytes, autodetect compression type at each
> record start,
> >     implement hardlinks and have TRAILER!!! flush hardlink context...
> >
> >
> > I'm not so sure about padding the last entry to 512-byte boundary. 512
> looks
> > like a random value to me? (Or an implementation detail of GNU cpio and
> > gen_init_cpio). Nonetheless I think we should pad the last record to
> 4-byte
> > boundary, so that both
> >
> >   cat a.cpio.gz b.cpio.gz >c.cpio.gz
>
> It's been padding it to a 4 byte boundary all along, that's what those
> trailing
> 4 NULs on TRAILER?!? were for. (The first is the null terminator for the
> string,
> the other 3 are padding for alignment: 110+10+1+3=124 which is 31*4.)
>
>
Ah you're right this was always the case, my mistake, I misread the code.
So the only problematic case was extra padding between two cpio file
frames, which was handled by the latest code.

> > and
> >
> >   zcat a.cpio.gz b.cpio.gz >c.cpio
> >
> > are valid initramfs/cpio?
>
> That's the headache part: should zcat understand that sort of
> concatenation? The
> gnu/dammit cpio implementation doesn't call out to compressors, you MUST
> do it
> in a pipeline. And even if zcat understood concatenation with NUL bytes,
> you can
> glue an .xz file to a .gz. Which tool hands off to which tool and when does
> control come BACK... (At least gz has little signatures at the start of
> blocks
> so runs of NUL bytes can be detected as invalid. Don't remember what bz2
> does
> off the top of my head, and never learned what xz does...)

> I can teach my cpio to call out to decompressors, but this is new design
> that
> needs to be thought through. Does it automatically do it, is there a new
> flag?
> Is this decompression side only and the compression side still needs its
> output
> piped?

I highly doubt zcat support initramfs-style concatenated .gz. AFAICT, in
order to deal with "(cat a.cpio.gz && echo -n -e '\0\0' && cat
b.cpio.gz)>initramfs.img", right now we need to use tools such as binwalk
&& dd to slice the initramfs.img into its individual components, and then
pipe the sliced chunks into zcat, lz4cat ... whatever-cat. It sure sounds
useful for cpio to have an option or flag (like tar) to let it auto detect
the compression method and call the compression library.

> Ok, I just checked bzcat.c and each compressed block starts with a 48 bit
> signature (with two valid, both nonzero values), so runs of zeroes can
> also be
> detected as "not valid block". Unfortunately that's reading in 4k blocks so
> you'd have to pad with a LOT of zeroes for it not to eat the start of the
> next
> chunk. (I can reduce the IOBUF_SIZE in my implementation but if it's
> calling an
> old version or some OTHER implementation when it runs "bzcat" out of the
> $PATH...) Concatenating uncompressed archives should be safe, and
> concatenating
> gzip chunks can presumably be _made_ safe, but with arbitrary archivers
> how much
> NULL padding you need is undefined, and so is what error states they'll
> exit with...
>
> Also, if we're diverging from the gnu/dammit version this far, I've had a
> todo
> item to teach my cpio to both understand and generate the kernel's
> gen_init_cpio.sh text file format for a while now. And ALSO it would be
> nice if
> there was a more conventional "recurse and make an archive from this list
> of
> files on the command line" the way tar and zip work; that's probably a new
> cpio
> -X option letter...
>
> These were all post-1.0 todo items until this can of worms got reopened. :)
>
> Rob
>

-- 

Yi-yo Chiang
Software Engineer
yochiang at google.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20210417/c6a15eba/attachment-0001.html>