<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Apr 18, 2021 at 4:59 AM Rob Landley <<a href="mailto:rob@landley.net">rob@landley.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 4/17/21 8:40 AM, Yi-yo Chiang wrote:<br>
> On Sat, Apr 17, 2021 at 7:32 PM Rob Landley <<a href="mailto:rob@landley.net" target="_blank">rob@landley.net</a><br>
> <mailto:<a href="mailto:rob@landley.net" target="_blank">rob@landley.net</a>>> wrote:<br>
> <br>
> On 4/17/21 4:43 AM, Yi-yo Chiang wrote:<br>
> > On Sat, Apr 17, 2021 at 2:56 PM Rob Landley <<a href="mailto:rob@landley.net" target="_blank">rob@landley.net</a><br>
> <mailto:<a href="mailto:rob@landley.net" target="_blank">rob@landley.net</a>><br>
> > <mailto:<a href="mailto:rob@landley.net" target="_blank">rob@landley.net</a> <mailto:<a href="mailto:rob@landley.net" target="_blank">rob@landley.net</a>>>> wrote:<br>
> ><br>
> > On 4/16/21 1:44 PM, Yi-yo Chiang wrote:<br>
> > > I'm not sure what Elliot's goal is? I assume he's trying to extract a<br>
> > > concatenated ramdisk, and I still see a problem in the current<br>
> solution. <br>
> > ><br>
> > > The buffer-format<br>
> > ><br>
> > <br>
> (<a href="https://www.kernel.org/doc/Documentation/early-userspace/buffer-format.txt" rel="noreferrer" target="_blank" class="cremed">https://www.kernel.org/doc/Documentation/early-userspace/buffer-format.txt</a>)<br>
> > says:<br>
> > ><br>
> > > initramfs := ("\0" | cpio_archive | cpio_gzip_archive)*<br>
> > ><br>
> > > In other words, both `cat a.cpio b.cpio >merged.cpio` and `(cat<br>
> a.cpio && echo<br>
> > > -n -e '\0\0\0' && cat b.cpio) >merged.cpio` are valid initramfs.<br>
> ><br>
> > It also implies that two compressed files can be concatenated and<br>
> separated by<br>
> > arbirary runs of nulls, or you can have a compressed file and a<br>
> non-compressed<br>
> > file concatenated, or...<br>
> ><br>
> ><br>
> > Correct. Upon further inspection, it's actually "arbitrary NULLs could<br>
> prepend a<br>
> > GZIP(cpio_archive)",<br>
> <br>
> I'm not currently handling that case, and I'm not sure where is the right place<br>
> to handle it? (Should gzip handle it, or should cpio call out to gzip?)<br>
> <br>
> And then you have to care that the _compressor_ stops gracefully at the end of<br>
> its compressed data isn't reading/discarding extra from its input...<br>
> <br>
> <br>
> I just read more into the kernel initramfs.c and decompressor_*.c, and seems<br>
> like even the kernel doesn't handle this all that well.<br>
> For example, the gzip decompressor (inflate) stops gracefully at the end of<br>
> compressed data, but lz4 decompressor doesn't and errors when there is data past<br>
> the end of compressed data.<br>
<br>
It's possible to make this work right, but not _easy_ to do so, because of the<br>
read buffer issues.<br>
<br>
> Back to the original question, I think handling concatenated uncompressed cpio<br>
> is good enough.<br>
<br>
In theory that's in now.<br>
<br>
> I can teach my cpio to call out to decompressors, but this is new design that<br>
> needs to be thought through. Does it automatically do it, is there a new flag?<br>
> Is this decompression side only and the compression side still needs its output<br>
> piped?<br>
> <br>
> I highly doubt zcat support initramfs-style concatenated .gz.<br>
<br>
I wrote my own lib/deflate.c from scratch (keep meaning to finish the compressor<br>
side but my todo list runneth over and toybox is not my day job), so I'm pretty<br>
sure I can make it handle multiple concatenated files.<br>
<br>
And I vaguely recall that zlib's version of handling non-compressed data was to<br>
send it through to the output verbatim. (Which means if you have a tarball<br>
containing a gzip file things could get ugly.)<br>
<br>
> AFAICT, in order<br>
> to deal with "(cat a.cpio.gz && echo -n -e '\0\0' && cat<br>
> b.cpio.gz)>initramfs.img", right now we need to use tools such as binwalk && dd<br>
> to slice the initramfs.img into its individual components, and then pipe the<br>
> sliced chunks into zcat, lz4cat ... whatever-cat. It sure sounds useful for cpio<br>
> to have an option or flag (like tar) to let it auto detect the compression<br>
> method and call the compression library.<br>
<br>
Toybox tar doesn't use compression libraries for this, it forks another process<br>
and feeds data through a pipe. (Which gets us SMP automatically, and means we<br>
can use compression types we don't internally implement.) That said, I could<br>
easily add a --showsize option that prints the number of bytes of input consumed<br>
to the three compressors toybox implements. (Not that this is useful because I<br>
don't _require_ using the toybox compressors/decompressors, for interoperability<br>
reasons, so can't depend on a feature I'd add.)<br>
<br>
The problem isn't figuring out where the data _starts_, it's figuring out where<br>
it _ends_. Long ago I had a design for a parallel bzip2 decompressor that would<br>
search ahead in the code for the next bzip2 start of block signature and<br>
dispatch each chunk to a thread pool (and then only keep the results when the<br>
previous block said "we ended here" and that was one of the starting points;<br>
ones that got bypassed because false positive in the middle of a block would<br>
just have their output discarded).<br>
<br>
But at this point bzip2 is obsolete enough I'm unlikely to bother even when I<br>
make it that far down on my todo list. That said, it shouldn't be too hard to do<br>
something similar for gzip? (The start of block has to be byte aligned.) And<br>
_specifically_ this would be "figure out where the next compression start is,<br>
keep the last X kilobytes of data we fed to the decompressor pipe, and start the<br>
next one at the last compressor start signature match near the ending point<br>
where the decompressor gave up.<br>
<br>
Which still doesn't help you with _decompressed_ data between compressed runs...<br>
<br>
What might be useful is to special case gzip, andn handle that with the internal<br>
deflate code, which can accurately measure the number of bytes consumed and<br>
preserve any data after that. And then just say that's the ONLY compression type<br>
that concatenation works for.<br>
<br>
I could also just teach gzip that you can concatenate .gz files, </blockquote><div><br></div><div>This has always been the case. AFAIK this is an explicit feature of gzip.</div><div><br></div><div>This:</div><div> gzip -c a >a.gz</div><div> gzip -c b >b.gz</div><div> cat a.gz b.gz > ab.gz</div><div> zcat ab.gz >ab</div><div>and this:</div><div> cat a b >ab</div><div>have the same effect.</div><div> </div><div>Thus supporting concatenated .cpio.gz is as simple as:</div><div> zcat a.cpio.gz b.cpio.gz | cpio -i</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">and say we<br>
support concatenating file.cpio.gz or file.cpio but don't mix them.<br></blockquote><div><br></div><div>Concatenating .cpio.gz or .cpio without mixing is already supported (with the latest "go past TRAILER!!!" change, cheers!) The problematic case is when NUL-paddings are thrown into the mix. I think calling native inflate code to determine the start and end offset of compressed region sounds good, as DEFLATE seems to encode an explicit END marker so the end offset is well-defined. LZ4 on the other hand, does not; lz4 (legacy format used by kernel) uses EOF to tell the end of compressed stream, thus making it diffficult to determine the actual boundary when we are concating compressed and uncompressed data. Other formats I don't know.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Rob<br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><table width="90%" border="0" cellspacing="0" cellpadding="0" style="margin:0px;padding:0px;font-family:"Times New Roman";max-width:348px"><tbody style="margin:0px;padding:0px"><tr style="margin:0px;padding:0px"><td style="padding:0px"><table border="0" cellspacing="0" cellpadding="0" style="margin:0px;padding:20px 0px 0px"><tbody style="margin:0px;padding:0px"><tr style="margin:0px;padding:0px"><td valign="top" style="padding:0px 20px 0px 0px;vertical-align:top;border-right:1px solid rgb(213,213,213)"><img src="https://i.imgur.com/eGpkLls.png" width="200" height="64"><br></td><td style="padding:0px 0px 0px 20px"><table border="0" cellspacing="0" cellpadding="0" style="margin:0px;padding:0px"><tbody style="margin:0px;padding:0px"><tr style="margin:0px;padding:0px"><td colspan="2" style="font-family:Arial,Helvetica,Verdana,sans-serif;padding:1px 0px 5px;font-size:13px;line-height:13px;color:rgb(56,58,53);font-weight:700">Yi-yo Chiang</td></tr><tr style="margin:0px;padding:0px"><td colspan="2" style="font-family:Arial,Helvetica,Verdana,sans-serif;padding:0px 0px 5px;font-size:11px;line-height:13px;color:rgb(56,58,53)">Software Engineer</td></tr><tr style="margin:0px;padding:0px"><td colspan="2" style="font-family:Arial,Helvetica,Verdana,sans-serif;padding:0px 0px 5px;font-size:11px;line-height:13px;color:rgb(56,58,53)"><a href="mailto:yochiang@google.com" target="_blank">yochiang@google.com</a></td></tr><tr style="margin:0px;padding:0px"><td colspan="2" style="font-family:Arial,Helvetica,Verdana,sans-serif;padding:0px 0px 3px;font-size:11px;line-height:13px;color:rgb(3,112,248)"></td></tr></tbody></table></td></tr></tbody></table></td></tr></tbody></table></div></div></div>