[Toybox] [PATCH] cpio: support reading concatenated cpio files.

Sat Apr 17 04:47:14 PDT 2021

On 4/17/21 4:43 AM, Yi-yo Chiang wrote:
> On Sat, Apr 17, 2021 at 2:56 PM Rob Landley <rob at landley.net
> <mailto:rob at landley.net>> wrote:
> 
>     On 4/16/21 1:44 PM, Yi-yo Chiang wrote:
>     > I'm not sure what Elliot's goal is? I assume he's trying to extract a
>     > concatenated ramdisk, and I still see a problem in the current solution. 
>     >
>     > The buffer-format
>     >
>     (https://www.kernel.org/doc/Documentation/early-userspace/buffer-format.txt)
>     says:
>     >
>     >   initramfs  := ("\0" | cpio_archive | cpio_gzip_archive)*
>     >
>     > In other words, both `cat a.cpio b.cpio >merged.cpio` and `(cat a.cpio && echo
>     > -n -e '\0\0\0' && cat b.cpio) >merged.cpio` are valid initramfs.
> 
>     It also implies that two compressed files can be concatenated and separated by
>     arbirary runs of nulls, or you can have a compressed file and a non-compressed
>     file concatenated, or...
> 
> 
> Correct. Upon further inspection, it's actually "arbitrary NULLs could prepend a
> GZIP(cpio_archive)",

I'm not currently handling that case, and I'm not sure where is the right place
to handle it? (Should gzip handle it, or should cpio call out to gzip?)

And then you have to care that the _compressor_ stops gracefully at the end of
its compressed data isn't reading/discarding extra from its input...

> "arbitrary 4-aligned NULLS prepend a *uncompressed*
> cpio_archive"

This case it should be handling now.

> and "cpio_file/cpio_trailer within a cpio_archive have to be
> 4-aligned with arbitrary NULLs". initramfs.c seems to try very hard to respect
> the alignment requirement, but I guess we could just skip *ANY* extra NULLs for
> simplicity?

It was already 4-aligned. That's part of the file specification. Padding with
_more_ than that was throwing it off, though. Should handle it now?

(Let me know what other tests I should add to tests/cpio.test.)

>     Grrr. I need to test this. And possibly genericize the tar.c code to detect
>     compression type and run it through a decompressor so cpio can do it too...
> 
> 
> Sounds like another can of worms... :/

Indeed. Haven't started that yet because tar.c is already doing it and I want to
factor out common code from that, ala:

        // detect gzip and bzip signatures
        if (SWAP_BE16(*(short *)hdr)==0x1f8b) toys.optflags |= FLAG_z;
        else if (!memcmp(hdr, "BZh", 3)) toys.optflags |= FLAG_j;
        else if (peek_be(hdr, 7) == 0xfd377a585a0000UL) toys.optflags |= FLAG_J;
        else error_exit("Not tar");

> The buffer-format.txt seems to be a bit outdated, as Linux now supports a lot of
> compression types besides gzip, and all of which are configurable
> (https://elixir.bootlin.com/linux/latest/source/lib/decompress.c#L52). So the
> initramfs grammar implemented by initramfs.c is in reality:
> 
>   initramfs  := ("\0" | cpio_archive | compressed_cpio_archive)*
>   compressed_cpio_archive := CONFIG_COMPRESSION_ALGORITHM(cpio_archive)
>   CONFIG_COMPRESSION_ALGORITHM := GZIP | BZIP2 | LZMA | XZ | LZO | LZ4 | ZSTD
> 
> where the exact set of compression algorithms are decided by the kernel config. 

Exactly. Toybox knows about gzip, bzip, and xz. (The only compressor I currently
plan to natively support is gzip, but it has decompressors for the other two.
The xz one is a bit stale and still in pending and needs serious cleanup, but
was sourced from public domain code.)

I can add more, but it hadn't previously come up?

Also, I'm really fuzzy on the difference between xz/lzma/lzo/lz4/zstd.

>     > btw gen_init_cpio.c also pads initramfs to 512-byte boundary
>     >
>     (https://github.com/torvalds/linux/blob/6fbd6cf85a3be127454a1ad58525a3adcf8612ab/usr/gen_init_cpio.c#L97)
> 
>     *blink* *blink* Why...? (cpio doesn't have a 512 stride in the file format? It
>     has a 4-byte stride for padding strings with NUL bytes, but that's about it?)
> 
>     > If we're viewing buffer-format.txt as the "right" cpio spec, then I think we
>     > should implement this too. We should skip arbitrary extra NUL-bytes padded
>     > between cpio file frames
> 
>     Skipping arbitrary extra null bytes at the start is easy enough to do. I guess
>     the hardwired trailing read was expecting the 512 padding...
> 
>     I'm gonna need add a _lot_ more test suite entries for this command.
> 
>     Ok, skip arbitrary leading NUL bytes after each entry, pad last record to 512
>     byte alignment with NUL bytes, autodetect compression type at each record start,
>     implement hardlinks and have TRAILER!!! flush hardlink context...
> 
> 
> I'm not so sure about padding the last entry to 512-byte boundary. 512 looks
> like a random value to me? (Or an implementation detail of GNU cpio and
> gen_init_cpio). Nonetheless I think we should pad the last record to 4-byte
> boundary, so that both
> 
>   cat a.cpio.gz b.cpio.gz >c.cpio.gz

It's been padding it to a 4 byte boundary all along, that's what those trailing
4 NULs on TRAILER?!? were for. (The first is the null terminator for the string,
the other 3 are padding for alignment: 110+10+1+3=124 which is 31*4.)

> and
>   
>   zcat a.cpio.gz b.cpio.gz >c.cpio
> 
> are valid initramfs/cpio? 

That's the headache part: should zcat understand that sort of concatenation? The
gnu/dammit cpio implementation doesn't call out to compressors, you MUST do it
in a pipeline. And even if zcat understood concatenation with NUL bytes, you can
glue an .xz file to a .gz. Which tool hands off to which tool and when does
control come BACK... (At least gz has little signatures at the start of blocks
so runs of NUL bytes can be detected as invalid. Don't remember what bz2 does
off the top of my head, and never learned what xz does...)

I can teach my cpio to call out to decompressors, but this is new design that
needs to be thought through. Does it automatically do it, is there a new flag?
Is this decompression side only and the compression side still needs its output
piped?

Ok, I just checked bzcat.c and each compressed block starts with a 48 bit
signature (with two valid, both nonzero values), so runs of zeroes can also be
detected as "not valid block". Unfortunately that's reading in 4k blocks so
you'd have to pad with a LOT of zeroes for it not to eat the start of the next
chunk. (I can reduce the IOBUF_SIZE in my implementation but if it's calling an
old version or some OTHER implementation when it runs "bzcat" out of the
$PATH...) Concatenating uncompressed archives should be safe, and concatenating
gzip chunks can presumably be _made_ safe, but with arbitrary archivers how much
NULL padding you need is undefined, and so is what error states they'll exit with...

Also, if we're diverging from the gnu/dammit version this far, I've had a todo
item to teach my cpio to both understand and generate the kernel's
gen_init_cpio.sh text file format for a while now. And ALSO it would be nice if
there was a more conventional "recurse and make an archive from this list of
files on the command line" the way tar and zip work; that's probably a new cpio
-X option letter...

These were all post-1.0 todo items until this can of worms got reopened. :)

Rob