[Toybox] [New Toys] - fstype, blkid

Wed Oct 9 14:06:11 PDT 2013

On 10/09/2013 02:40:07 AM, Conroy, Bradley Quentin wrote:
> Thanks for the feedback Rob.
> 
> > Allow me to introduce you to aboriginal linux system images. Go to:
> 
> I've used it before with x86 qemu, I'll have to build qemu v1.5
> 
> > My limiting factor here is actually lack of test filesystem images.
> 
> That was mine too, but now I have ext{2,3,4}, f2fs, minix, msdos,  
> ntfs,
> reiser3, squash{3,4} and vfat.  They compress down to a few kb with
> bz2.

If you could send me a copy of those, I would appreciate it.

> >Ok, convert tabs to two spaces and check that in.
> 
> done

Ah, sorry, should have clarified: this was a running log of what I was  
doing to my copy, so I could link the post from  
http://landley.net/aboriginal/cleanup.html once I get around to  
updating that again.

(On my local todo lists that update is the next todo dependent on  
finishing the ifconfig cleanup which means A) researching IPv6, B)  
researching the portableish functions if_nametoindex and if_nameindex  
to see if it should be using them instead of groveling around in /proc  
directly, C) working out what to do about infiniband addresses not  
fitting in the standard return structure, given that I haven't got an  
infiniband test environment.)

(Yeah, my todo dependencies are weird. I get blocked on strange stuff  
sometimes...)

> >Ok, yank the typedef. Make function definitions match K&R like
> >everybody else for the past 30 years, ala:
> 
> Damn, that's how I had it to begin with.
> 
> >You don't need #if CFG_BLKID because blkid.c only gets compiled if
> >CFG_BLKID is enabled. (If the name of a *.c file under toys/ matches
> >the name of a config symbol, the C file's inclusion is controlled by
> >that config symbol.)
> 
> I modified it so blkid depends on fstype
> 
> >You have an if() statement at the left edge, not indented at all  
> within
> >its function, and then the function ends with:
> >...
> >And the _reason_ that works is there's no curly bracket on the else  
> so
> >the write() belongs to the else but the putchar doesn't. Otherwise  
> the
> >function wouldn't end. Ouch.
> 
> I added {} to the fstype write, the putchar is shared, blkid needs it  
> too,
> but I can add it to the end of the TYPE="" and enclose the putchar in
> the fstype section
> 
> >The way to make an alias for a command is the OLDTOY() macro.
> >...
> >If you feed loopfiles() zero arguments, it reads from stdin. So  
> calling
> >blkid with no arguments hangs awaiting user input instead of printing
> >its usage message. (Probably you don't want NULL optstring, you want
> >"<1", at least for the moment.)
> 
> USE_FSTYPE(NEWTOY(grep, "<1" , TOYFLAG_BIN))
> USE_BLKID(OLDTOY(blkid,fstype, OPTSTR_fstype, TOYFLAG_BIN))
> 
> > Squashfs? Hello?
> 
> I'll test it against your aboriginal images.
> 
> >By the way, in terms of your 64k buffer (66k buffer, actually): no  
> sane
> >filesystem is going to have its identifying info straddle 4k blocks,  
> so
> >we should be able to read 4k chunks and iterate over the list for
> >offsets in range. (This even avoids lseek, although I'm not sure why
> >that would be an issue...)
> 
> NTFS label does but it is also has other "p\0r\0o\0b\0l\0e\0m\0s\0"
>  - Add it as a config option?

Not sure. In my first pass I just yanked it.

The problem is that treating "p\0r\0o\0b\0l\0e\0m\0s\0" as ascii is  
wrong because that won't handle japan/china/india/korea correctly. Most  
of the rest is naturally utf-8 clean if we just treat it as 8 bit, but  
this isn't. It needs an NLS translation layer with codepages (clearly  
beyond the scope of this project) and I have no clue what to do about  
that.

I guess displaying the ascii is better than nothing, but I'm not happy  
about it. Especially since this is the kind of thing where a GUI tool  
may call this to fetch the label and then try to display it, meaning my  
usual "internationalization is mostly the GUI's problem" approach may  
not apply here.

> >Right, continuing to clean this up until I can make it work. What the
> >HECK is this nest of MATCH macros calling each other for? (That's  
> where
> >the type punned pointer warnings come from, anyway...) Ah, it's only
> >used for ext2/3/4. Because treating ext2, ext3, and ext4 as three
> >separate filesystems just wouldn't do.
> 
> I had everything already in #defines and all of the ops were bit ops  
> or
> casts, so a function didn't make sense at the time.  After considering
> the different endianess and strict aliasing, I am thinking it may be
> simpler to: if (!memcmp(&toybuf, magic, sizeof(magic))) ... possibly
> using the SWAP_LE*(x) macros, (will have to test on qemu)

I sent you my first pass at it. I need to add a SWAP_LE macro, but  
paused on that when I realized that FAT and ext2 use different  
endiannesses for their uuid-like thingies (and different case), and I'm  
not sure how to signal that. (Ok, I can reserve bits out of the thing  
and add code to the display thing to check but special casing for one  
filesystem's display is uncomfortable. Then again, it's the filesystem  
of USB keys, so...)

> >You don't need to strcmp toys.which->name with "blkid", you can just
> >compare the first character to 'b'. (There are only two options...)
> 
> Oops, thats kind of embarassing.
> 
> >Alright, let's turn this giant stack of #defines and if/else  
> staircase
> >into a table with a loop iterating over it.
> 
> If it were just magic @ offsets, that would work, but then you have
> ext 2/3/4 that use the same magic at the same offset and then there's
> vfat/fat32/fat16/fat12/msdos that has at least 14 different possible
> magic values.

I moved the magic into the C code after the loop. I moved ext2 to the  
front of the list so it was trivial to check for it (!index) and moved  
ntfs after it so the _next_ one needing special cases was also at a  
known offset. I suppose vfat should be third...

> > Lets make the magic a uint64_t so we're not ignoring the second
> > half of the btrfs magic
> >you've got listed there, and let's just use the hex numbers like the
> >kernel does, ala:
> >
> >fs/btrfs/ctree.h:#define BTRFS_MAGIC 0x4D5F53665248425FULL /* ascii
> >_BHRfS_M, no null */
> 
> the superblock magic for btrfs in magic.h is at 64k IIRC, which brings
> up a point, should I just #include <linux/magic.h> or are we trying
> to stay platform neutral

I looked at that and it has the magic values but not the offsets at  
which to find them or the size of the field, so it doesn't really buy  
us anything. (If we need to supply magic values for each filesystem  
ourselves anyway, might as well do all of them and avoid the  
dependency.)

> I was splitting the fstype data apart from the blkid-only data so  
> uuid and
> label data would compile out if only fstype was configured.  So that  
> would
> leave us with:
> {
>   char * type;
>   unsigned char magic[4];
>   unsigned mag_off:12;
>   unsigned uuid_off:12;
>   unsigned lab_off:12;
>   unsigned lab_sz:6;
>   unsigned mag_sz:4;
>   unsigned uuid_t:2;
> }
> Thats only 14 bytes each, only 4 bytes is extra for blkid, but some  
> types
> would need multiple entries

I'm not a huge fan of bitfields. If you're parsing external data,  
controlling the endianness isn't something I've managed reliably. If  
you're not parsing external data, the size saved in the structure is  
more than offset by bigger code, and the compiler tends to produce  
worse code than just doing the masks and shifts yourself. (Maybe this  
has changed with different compiler versions?)

> If I were a bit more clever, I could probably pull off layering a  
> union of
> padded structs over toybuf like:
> union fs_t{
>   struct adfs_t adfs;
>   ...
>   struct zfs_t zfs;
> } *fs = (union fs_t *)&buf
> 
> and pad out each fs struct like:
> struct adfs {
>   unsigned char pad[0xc00];
>   unsigned short magic;
>   /* more padding and uuid, label if they exist */
> }
> 
> Am I incorrect in assuming that this would take 0 extra RAM?

And here we hit a big design difference between busybox and toybox: I  
prioritize simplicity over size and speed. Yes you could save some  
space by doing that, but it's not worth it.

I'm willing to have GLOBALS() populate a union of global structs  
because every command benefits, it can be hidden away in headers and  
the amount of space saved is large and cumulative, and you only have to  
learn about it once for the whole project. (Plus it allows tracking of  
leaks via scripts/findglobals.sh.) But individual commands pulling this  
kind of thing... maybe if you're saving a couple kilobytes that would  
otherwise go on the stack or in a long-lived program that's likely to  
run in a low-memory environment. But saving less than page size of  
global data from something that launches and then exits immediately?  
Probably not worth complicating the code. (Saving executable size is  
probably enough of an argument to not have every entry be "int" and  
sprinkle a few "short"s in there, although then you have to worry about  
padding...)

But in general: I'm going for simple and easily understood.

> this would simplify things to:
> read (fd, &toybuf, sizeof(struct fs_t));
> if (fs->adfs.magic==ADFS_SUPER_MAGIC){

Space before the curly bracket please. (I need to update the coding  
style.)

>   fstype="adfs";
>   if(CFG_BLKID) //...
> }else if ... //similar for most of the rest
> lseek, read repeat

The version I did doesn't lseek, so if you do "zcat blah.img.gz | blkid  
-" it should work. No idea if this is a real world use case, of course,  
but it wasn't hard to do.

> I already have most of this work done, but discarded the idea because
> figuring out the padding for structs in 65k was a bit tedious, 4k is  
> much
> more manageable.
> 
> >Hmmm, you have a CRAMFS_MAGIC2 but your code doesn't seem to be using
> >it. (The if is using a MATCH() macro instead of MATCH2().) Ah, the
> >kernel header says that's the same number at the other endianness.
> 
> I actually meant to use MATCH2 for that.
> 
> >If JFS isn't even in /usr/linux/include/magic.h is it really an
> >important filesystem to autodetect?
> 
> My process at 5AM is pretty random, I was actually meaning to find  
> jffs,
> but by the time I realized they weren't the same thing it was already
> done.

I get a bit punchy at the end of long sessions too. (Next checkin,  
remind me to replace "murderfs" with the official name for the thing so  
I don't offend both remaining users.)

> >For NTFS, you have 8 as the label length (well, -8) but toutf8 fills
> >out a 16 byte buffer? (And it doesn't actually have a length, it just
> >keeps going until it hits a null terminator which there's no  
> guarantee
> >the file will have...)
> 
> $#@7 all the '\0's after adapting a simple strcpy will add  i<16 &&

I didn't follow that sentence at all. A strcpy will copy past the first  
null terminator?

> >Also, the NTFS label isn't _really_ alternating ascii and NUL bytes.
> >It's horrible 16 bit wide character stuff that involves "codepages"  
> and
> >actually displaying labels from japan or korea just isn't going to  
> work
> >here. (Doing full windows internationalization isn't an option  
> either.
> >The question is, does the special case for ascii make sense or should
> >we just not support labels here at all? I'm balancing "2/3 of the
> >planet does not speak english" with "does android care about legacy
> >windows crap that's this generation's version of punched cards?" Eh,  
> I
> >guess "windows was english only, the future is UTF8" is a reasonable
> >compromise...)
> >
> >However, add to that the fact that ntfs is the only filesystem that  
> has
> >a label in a different 4k block than the ID info, and special casing
> >this really sounds like more trouble than it's worth. Are there a lot
> >of thumb drives formatted NTFS out in the wild? (I'll add code to  
> deal
> >with a real world problem, my question is whether this is a real  
> world
> >problem? No idea.)
> >
> >Also... ntfs has an 8 bit uuid? What? (It's the only one that  
> does...)
> 
> I don't know if wprintf would work for that or not, never really used  
> it.

A wide character on linux is 32 bits, but... maybe? It punts to libc  
and LC_CTYPE and the following paragraph from the man page isn't  
promising:

   If  the  format  string contains non-ASCII wide characters, the  
program
   will only work correctly if the LC_CTYPE category of the current  
locale
   at  run time is the same as the LC_CTYPE category of the current  
locale
   at compile time.  This is because the wchar_t representation  is   
plat‐
   form-  and  locale-dependent.   (The  glibc  represents wide  
characters
   using their Unicode (ISO-10646) code point, but other  platforms   
don't
   do  this.   Also,  the use of C99 universal character names of the  
form
   \unnnn does not solve this problem.)  Therefore,  in   
internationalized
   programs,  the  format  string  should consist of ASCII wide  
characters
   only, or should be constructed at run time in an internationalized   
way
   (e.g., using gettext(3) or iconv(3), followed by mbstowcs(3)).

> Since I am nixing the buf for toybuf and ntfs spans over 4k, I think  
> I'll
> leave out the ntfs label chunks in the next revision and plan to add  
> it
> as a config option.  I think there are 1 or 2 others that use 8 byte  
> uuids.
> 
> >Hang on, this thing doesn't identify vfat? (Which most external USB
> >devices are formatted with?) Hmmm, I know microsoft's documentation
> >says not to use the "FAT16" and "FAT32" strings for filesystem
> >identification, but I don't care.
> 
> If you are good with that, I am.  That's what I have as a stub, but
> there are a ton of other checks that seem superfluous and haven't
> had the time/resources to track down.  All of my thumb drives seem
> to work with that.

I did a mkfs.msdos on a "truncate -s 1m blah.img" and looked at the  
output and it made a FAT12. (Sigh.) So I had "FAT32" and "FAT1" checks  
(because they're at different offsets and the second catches fat12 and  
16.)

> >Ok, printing out the uuid there's three different possible  
> bit-patterns
> >for where the "-" go, one for 16 (the default), one for 4 (vfat), and
> >one for 8 (ntfs, no dashes). I think rather than having a separate  
> uuid
> >length field that's usually 16 I'll encode the non-16 values in the  
> top
> >few bits of the offset, since I've got an int. (Offset already won't
> >fit in a short.)
> 
> I get tripped up with bswap*, but all those %X values could probably  
> be
> reduced to 4 (or 5 I guess, unless there is a 6 byte type)

I did a for loop and used an int as a bitfield of where to put the -  
signs.

In lib/portability.h there's SWAP_LE64() that converts to little endian  
64 bit. (If you're on little endian, it's NOP.) There's SWAP_BE64()  
too, plus 32 and 16 versions.

> >Hmmm, in testing FAT's uuid bytes are presented in reverse order from
> >the tool ubuntu's using. But ext2 isn't...
> 
> >Need test images. Lots and lots of test images...
> 
> I will attach a 30kb tarball with the ones I have in a separate post.

Yay!

I'll probably put them in the test suite.

> >> blkid does output for all devices if 0 args -> read  
> /proc/partitions?
> >
> >Possibly. (You can run the other one under strace to see what it's
> >doing.)
> 
> strace says it is looking in /dev and running getdents64... I am  
> guessing
> that even embedded is using devtmpfs these days,

The current udev versions are built on devtmpfs and no longer do node  
creation/deletion.

> but I know puppy
> linux (some versions) still has a static /dev, so blkid takes a  
> while.  My
> gut says /proc/partitions is the better way but then again procfs is a
> configurable option.  blkid /dev/sd* (or whatever) should work for  
> them.

I have no objection to using /proc/partitions.

Rob
 1381352771.0