[Toybox] archivers

Rob Landley rob at landley.net
Sun Dec 1 11:50:40 PST 2013


This is quite possibly stale, but just in case, lemme finish this  
half-reply from September:

On 09/13/2013 12:42:19 AM, Isaac wrote:
> On Mon, Aug 12, 2013 at 12:57:48PM -0500, Rob Landley wrote:
> > On 08/12/2013 12:30:05 AM, Isaac wrote:
> > >On Sun, Aug 11, 2013 at 10:07:49PM -0500, Rob Landley wrote:
> > >> On 08/11/2013 07:42:37 PM, Isaac wrote:
> > >> >I also have some code here that *should* handle writing a member
> > >> >of a cpio
> > >> >newc archive, loopfiles_stdin(), and a general idea of how to
> > >proceed.
> > >> >Don't expect much soon.
> 
> Right now I have code that will write a newc archive,
> but nothing to read it.

Will the kernel's initramfs stuff extract it? That's the biggest  
real-world user.

There's also the stanza I put in ramfs-rootfs-initramfs.txt in the  
kernel documentation:

   cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames

> (I note that loopfiles_stdin() is currently not as robust as is  
> desirable;
> it handles at most 4094-char lines because it fgets() into toybuf,
> needs a null terminator and a newline, and removes the newline.
> If there's no newline in a 4095-byte read, it ignores the line
> and starts munching bytes until it hits a \n or the end of file.
> But for regular systems where you rarely have a path
> more than 80 chars long, it's...useable.)

Yeah, that needs fixing. Although PATH_MAX is 4096 on Linux (and  
despite the openat() stuff that's still in use in some places).

> My big question is how archive readers should handle reading archives,
> in terms of API.
> I'm hoping that the reader for cpio can be reused in tar.

Same here, but I also think zip is important because lots of things  
(jar files, doc files, chrome extensions...) are renamed zip files.

I also want to write gene2fs as an archiver, which implies doing the  
same for a fat filesystem generator/extractor...

> Some general thoughts:
> 
> read_archiveformat() will take a file descriptor, and also needs  
> these:
> 
> -a list of files to extract (recursively? should be an option);
>  if NULL, extract all.
> 
> -a list to not extract;
>  if NULL, extract all.  This supersedes the list to extract.
> 
> When I say "list", I mean it loosely; any sort of data structure
> that can hold multiple records might work.
> But these lists are going to come straight from the command line,
> so a linked list is probably what the toy has to start with...

Let's do the first couple of them the simple way and then worry about  
common code after we have some commonalities to merge.

Also, some file formats are seekable (zip, ext2) and some aren't (tar,  
cpio). If an mtools-like "dir fat.img:path/to/file" winds up being of  
interest, making it share code with a vat extractor/generator might be  
easier if it wasn't built around tar's streaming assumptions. (or not,  
dunno yet...)

> -int mode: bitwise OR of WRITE = 1, VERBOSE = 2, CHANGE_OWNER = 4, ...
>  listing files (tar -t / cpio -t) needs to be implemented with read()
> rather than lseek() to get from one header to the next, due to the
> likelyhood of getting a stream.  So the simple approach is to use
> the same function call for both, with the call to write() conditional.
> This also allows verbose extraction fairly easily.
> 
> CHANGE_OWNER will use the recorded owner:group, if chown() is  
> permitted.
> 
> Is this reasonable?
> Are there any points I may be missing?

I don't know. I haven't written tar and zip yet. :)

It sounds reasonable?

I sort of had this problem space in mind when I did the dirtree stuff.  
In the case of genext2 you have to output the metadata before the file  
contents, and you need to know block offsets if you're doing it in a  
streaming fashion. So you have to scan the directories ahead of time to  
get all the file sizes (and then if files change while you're  
traversing you have to pad or truncate them to match the metadata  
snapshot. There's a bit of this in rsync too, where it compares  
metadata before comparing file contents and they can change between the  
two passes...)

I was thinking of wrapping archive creation around dirtree. (And  
possibly archive extraction too if it has to keep track of what  
directory it's currently in.) But I haven't actually done the work yet,  
and concrete beats speculation...

> > >> >(newc is essentially an octal dump of struct stat; oldc differs
> s/octal/hexadecimal/ for newc. oldc/odc is octal.
> 
> > >> >only in
> > >> >having a few fields missing or differently sized. cpio is
> > >> >well-suited to
> > >> >loopfiles, except it reads from stdin.
> > >>
> > >> I've only been looking at newc, because that's what initramfs  
> uses.
> > >> If RPM (the other major modern user of cpio I'm aware of) also  
> uses
> > >> newc, then that's probably all we need to implement. (Dunno about
> > >> full rpm support, but we should be able to do rpm2cpio.)
> > >
> > >http://people.freebsd.org/~kientzle/libarchive/man/cpio.5.txt
> > >https://github.com/ereshetova/rpm/blob/master/lib/cpio.h
> > >
> > >Yes, that's what RPM uses.
> > >mksh used to be distributed in cpio odc format, which really  
> doesn't
> > >take all that different code. They use some tar format now.
> >
> > It's like patch only doing -u: there are files out there that
> > _aren't_ in that format, but there are other tools if you really
> > need that.
> >
> > >By the way: for extracting CPIO archives, if nlink > 1, check
> > >devmajor,
> > >devminor, and ino. If they have not been recorded, record and  
> extract;
> > >if they have, create a hard link.
> 
> Actually, it's record the file annd re-extract every time.
> The last record with a given device/inode overwrites all previous  
> ones,
> so setting size to 0 for everything before that is OK.
> 
> <snip a bad approach>

Um... ok?

I need to break down and do a hash table or balanced tree or something  
to store inodes in. The kernel guys have finally threatened to make  
them 64 bit entities everywhere, so a 4k page can store 512 of them,  
and sorting a 512 entry array is cheap (heck, just insert it at the  
right place and move everything else down each time), so if we have a  
tree of sorted 4k blocks that splits into a pair of child nodes when it  
fills up and then just ignore balancing for now because the block size  
prevents the tree from getting too deep in anything _like_ a common  
case...

Something simple like that as a first pass, maybe?

> > Sigh. I've been meaning to do some variant of a really simple hash
> > table or balanced tree. (More likely a tree.) Tar needs hardlink
> > checking too, and the way I want to write gene2fs that's an archiver
> <snip>
> > Anyway; I'm uncomfortable with an n^2 data storage/lookup mechanism
> > on an unbounded data set. Try to cpio a gigabyte directory and it
> > could get really slow. (Hmmm... but if it only cares about
> > link>1...)
> If nlink = 1, there's no reason to treat the file special.
> 
> > >int orig_ino;
> > >int nlink; // decrement by 1 on every write matching file?
> > >dev_t orig_dev; //devmajor + devminor on newc/"SVR4"
> > >long new_ino;
> > >dev_t new_dev;
> 
> Looking over things again, I see that there might be a call for
> including the fd in the table.
> Or that may be a hair-brained idea.

Keep in mind your average process has a 1024 open filehandle limit, and  
for archivers that's a real limiting factor.

> Thanks,
> Isaac Dunham

Rob


More information about the Toybox mailing list