[Toybox] archivers

Isaac ibid.ag at gmail.com
Thu Sep 12 22:42:19 PDT 2013


On Mon, Aug 12, 2013 at 12:57:48PM -0500, Rob Landley wrote:
> On 08/12/2013 12:30:05 AM, Isaac wrote:
> >On Sun, Aug 11, 2013 at 10:07:49PM -0500, Rob Landley wrote:
> >> On 08/11/2013 07:42:37 PM, Isaac wrote:
> >> >I also have some code here that *should* handle writing a member
> >> >of a cpio
> >> >newc archive, loopfiles_stdin(), and a general idea of how to
> >proceed.
> >> >Don't expect much soon.

Right now I have code that will write a newc archive,
but nothing to read it.
(I note that loopfiles_stdin() is currently not as robust as is desirable; 
it handles at most 4094-char lines because it fgets() into toybuf,
needs a null terminator and a newline, and removes the newline.
If there's no newline in a 4095-byte read, it ignores the line
and starts munching bytes until it hits a \n or the end of file.
But for regular systems where you rarely have a path 
more than 80 chars long, it's...useable.)

My big question is how archive readers should handle reading archives, 
in terms of API.
I'm hoping that the reader for cpio can be reused in tar.
Some general thoughts:

read_archiveformat() will take a file descriptor, and also needs these:

-a list of files to extract (recursively? should be an option);
 if NULL, extract all.

-a list to not extract;
 if NULL, extract all.  This supersedes the list to extract.

When I say "list", I mean it loosely; any sort of data structure
that can hold multiple records might work.
But these lists are going to come straight from the command line,
so a linked list is probably what the toy has to start with...

-int mode: bitwise OR of WRITE = 1, VERBOSE = 2, CHANGE_OWNER = 4, ...
 listing files (tar -t / cpio -t) needs to be implemented with read() 
rather than lseek() to get from one header to the next, due to the 
likelyhood of getting a stream.  So the simple approach is to use
the same function call for both, with the call to write() conditional.
This also allows verbose extraction fairly easily.
CHANGE_OWNER will use the recorded owner:group, if chown() is permitted.

Is this reasonable?
Are there any points I may be missing?

> >> >(newc is essentially an octal dump of struct stat; oldc differs
s/octal/hexadecimal/ for newc. oldc/odc is octal.

> >> >only in
> >> >having a few fields missing or differently sized. cpio is
> >> >well-suited to
> >> >loopfiles, except it reads from stdin.
> >>
> >> I've only been looking at newc, because that's what initramfs uses.
> >> If RPM (the other major modern user of cpio I'm aware of) also uses
> >> newc, then that's probably all we need to implement. (Dunno about
> >> full rpm support, but we should be able to do rpm2cpio.)
> >
> >http://people.freebsd.org/~kientzle/libarchive/man/cpio.5.txt
> >https://github.com/ereshetova/rpm/blob/master/lib/cpio.h
> >
> >Yes, that's what RPM uses.
> >mksh used to be distributed in cpio odc format, which really doesn't
> >take all that different code. They use some tar format now.
> 
> It's like patch only doing -u: there are files out there that
> _aren't_ in that format, but there are other tools if you really
> need that.
> 
> >By the way: for extracting CPIO archives, if nlink > 1, check
> >devmajor,
> >devminor, and ino. If they have not been recorded, record and extract;
> >if they have, create a hard link.

Actually, it's record the file annd re-extract every time.
The last record with a given device/inode overwrites all previous ones,
so setting size to 0 for everything before that is OK.

<snip a bad approach>

> Sigh. I've been meaning to do some variant of a really simple hash
> table or balanced tree. (More likely a tree.) Tar needs hardlink
> checking too, and the way I want to write gene2fs that's an archiver
<snip> 
> Anyway; I'm uncomfortable with an n^2 data storage/lookup mechanism
> on an unbounded data set. Try to cpio a gigabyte directory and it
> could get really slow. (Hmmm... but if it only cares about
> link>1...)
If nlink = 1, there's no reason to treat the file special.

> >int orig_ino;
> >int nlink; // decrement by 1 on every write matching file?
> >dev_t orig_dev; //devmajor + devminor on newc/"SVR4"
> >long new_ino;
> >dev_t new_dev;

Looking over things again, I see that there might be a call for 
including the fd in the table.
Or that may be a hair-brained idea.

Thanks,
Isaac Dunham

 1379050939.0


More information about the Toybox mailing list