[Aboriginal] uClibc-0.9.33.2 statfs() does not populate the `f_frsize' field of `struct statfs'

Fri Dec 28 16:45:11 PST 2012

On 12/27/2012 10:52:13 PM, Rajeev V. Pillai wrote:
> > Rob Landley <rob at landley.net> on Friday, December 28, 2012 9:37 AM  
> wrote:
> >
> 
> > Are there any that implement it? Linus said he wanted to see  
> fragments go away in 1995:
> 
> I'm pretty certain that Novell NetWare 4.x natively does it for its  
> filesystem.
> Not sure if Linux's NWFS implementation handles those fragments,  
> though.
> Also not sure if FSes which do block suballocation (Btrfs, Reiser4,  
> UFS2)
> handle the sub-blocks as fragments and report them as such in  
> `f_frsize'.
> It seems logical that they should.

I know that reiserfs didn't. And I'm pretty sure btrfs doesn't.

The thing is, things like tail packing perform variable sized  
sub-allocations, so reporting a single "fragment size" number for the  
whole filesystem is meaningless in that context. (And really, tail  
packing is a special case.)

That fragment field dates back to attempting to have a block size  
smaller than actual physical transactions were done in, and Linus  
basically pointed out that the smaller value is the real block size and  
the larger value is totally artificial so attempting to maintain  
multiple levels is sad. (The block layer can sort and merge outstanding  
requests, that's the "I/O elevator" code. Trying to do this is not a  
filesystem's job.)

I.E. _WHEN_ you do block suballocation, the granularity in which you do  
so is bytes. So the fragment size would always be "1", which is useless.

I.E. this really did get discarded 17 years ago and nobody's  
resurrected it since, because it was a bad idea.

> And, given the push towards larger block sizes, more FSes will start  
> to implement
> something like fragments.

No, they won't. You're acting like this is a new thing instead of a  
topic of discussion for many years now:

   https://lwn.net/Articles/250335/
   https://lwn.net/Articles/349970/

Again because a _single_ fragment size is nonsensical, what you want  
are variable sized chunks. And what you can do to get them is demand  
that 4096 block ranges be contiguous and then store a count of the  
number of them you've used, which is an optimization ext2 has been  
using from day 1 and BSD used before that.

If your argument is "we must be able to subdivide filesystem blocks",  
people do so via byte ranges. (They jump from granularity 4096 to  
granularity 1.) When it's "we must use larger transactions than  
filesystem blocks", people group blocks but continue to track the  
allocation at either block size or byte size.

Some media naturally use larger transaction sizes than the filesystem  
block size, but the fix for that is to make the journaling layer aware  
of it so it can group the commits. This isn't just a block size issue,  
it's an alignment issue. When disks started increasing block sizes to  
_match_ the block sizes filesystems had been using for years, there was  
a problem that the 512 byte "boot sector" put things out of alignment,  
and we had to update partitioning programs to create partitions that  
started at the right offset:

   https://lwn.net/Articles/322777/

Note that making filesystem block sizes larger than the memory page  
size didn't happen. And even though we've got _terabytes_ of memory on  
some of the larger systems, the default RAM allocation size is staying  
4k.

Yes there are hugepages to conserve TLB entries, and you can format an  
ext4 filesystem with huge blocks so it doesn't spend forever parsing  
allocation tables:

   https://lwn.net/Articles/469821/

But note that when they do that, they don't sub-allocate within  
hugepages or huge blocks, because doing so DEFEATS THE PURPOSE OF  
HAVING THEM. This isn't a "fragment size" because you don't fragment  
them. Subdividing them is the application's problem, not the OS's.

Rob
 1356741911.0