[Toybox] Notes about mount: VFS flags.

Sun Aug 12 21:23:51 PDT 2012

The mount system call takes five arguments:

int mount(char *source, char *target, char *filesystemtype,
          unsigned long mountflags, void *data);

The first is "what to mount", the second "where to mount it", and the
third is "filesystem type".  The simplest variant of this is a block
device, a directory, and a string like "ext2" or "9p". For other
variants, see http://landley.livejournal.com/52326.html

The last two are options, ala "mount -o flag,flag,flag". It's split into
two parts: a bitfield of things the VFS layer should know about
(argument 4), and a string of the remaining comma-separated options
(argument 5).

The mount command filtes out the VFS options it recognizes, uses them to
set the bitfield, and passes the remaining options to the kernel. (If
mount doesn't filter them out, most filesystem drivers will barf on the
unkown option strings.)

On 32 bit platforms "unsigned long" gives you 32 bits, so I dunno why
they didn't just define this field as "unsigned int" unless they wanted
flags that could only apply to 64 bit systems? (At the moment, there
aren't any.)

--- Values

Here's all the flags in order by common string abbreviation the -o
option uses to trigger them (and the kernel shows in /proc/mounts).

char *vfsflag[] = {
    "ro", "nosuid", "nodev", "noexec", "sync", "remount", "mand",
    "dirsync", "", "", "noatime", "nodiratime", "bind", "move", "rec",
    "silent", "posixacl", "unbindable", "private", "slave", "shared",
    "relatime", "kernmount", "iversion", "strictatime", "", "", "",
    "nosec", "born", "active", "nouser"
}

Each vfsflag[i] has the value 1<<i. The "" entries are gaps that haven't
got a flag defined there yet. Still a little space for future expansion.
Most of these have an MS_CONSTANT #defined in some header file, but the
name of the constant and the name of the string /proc/mounts shows for
that vfs flag aren't always the same, and /proc/mounts wins.

--- NOPS:

The flags "relatime", "active", "born", and "kernmount" are all filtered
out by the kernel in do_mount(). The flags "posixacl", "nosec", and
"nouser" are apparently INTENDED by the kernel to be NOPS (filesystems
set them internally to the appropriate values), but the kernel doesnt'
filter them out and you can presumably break stuff.

(A mount with nouser will always fail,  Don't ask me what happens if you
pass MS_POSIXACL to mount when mounting jffs2 with access control list
support disabled in the filesystem but enabled in the VFS... Yes, the
relatime flag is a NOP because it's switched on by default and setting
noatime switches it off, so the flag itself means nothing. I'm not sure
what the subtle difference between "born" and "active" is.)

Here's what the NOP flags mean:

"relatime" - only update atime once per day (unless moving it back):
this is the default behavior so the flag is useless.

"born" - "is mount finished". Flag is set in fs/super.c:mount_fs(), and
never cleared. It means filesystem has successfully been mounted
somewhere in the tree, and has thus finished initialization.

"active" - "has umount started". Set in several different mount
codepaths, cleared by umount's fs/super.c:generic_shutdown_super(). When
this flag is clear the mount's disk cache gets freed (open files go away
as soon as each inode's reference count count reaches zero).

"kernmount" - Kernel internal mount, set in fs/super.c:kern_mount_data()
and used in procfs and the posix message queue filesystem to indicate...
something. (Presumably that if you unmount it, the superblock shouldn't
go away.)

"posixacl" - Support OS/2 extended attributes, I mean the macintosh
"resource fork", I mean violate unix's basic idea of having everything
be a file and stick extra data in a bag on the side of each file. The
security guys wanted this for approximately the same reason the TSA
doesn't feel safe without making you take off your shoes and go through
a porno-scanner: they can't solve the problem with the resources they've
already got, and think more resources will somehow help. We copied the
term "access control lists" from Windows NT to give you an idea how good
an idea it is. Anyway, this flag is set by the filesystem driver to tell
the VFS what to do, supplying it from userspace really can't end well.

"nosec" - set in fs/super.c:mount_bdev() on first mount, means the
filesystem hasn't had any recent security context changes (permissions
added/removed in a way that suid would need to examine closely) so it
can fast-path some stuff.

"nouser" - this filesystem cannot be mounted from userspace. (You can
set this bit on any attempt to mount, guaranteeing it will fail! Note:
not the same as fstab's user/nouser, there "nouser" means only root can
mount it. This flag means even root can't mount it. For kernel internal
things like sockfs and pipefs where you get a filehandle but it's not to
anything mounted on the filesystem tree.)

--- Superblock flags

These flags affect the entire filesystem no matter how many times it's
mounted:

"sync" - writes block until the data gets flushed to backing store.

"dirsync" - sync for metadata only

"mand" - allow mandatory locking (mostly an NFS thing I think?)

--- per-mountpoint flags

These flags can be different for each place a filesystem is mounted

"ro" - everything under this mountpoint is read only.

"noatime" - don't update atime at all.

"nodiratime" - don't update atime on directories

"strictatime" - update atime every time a file is accessed

"nosuid" - Ignore the SUID bit (and sgid bit)

"noexec" - Ignore the exec bit

"nodev" - Don't allow device nodes to work.

--- Codepath selection

These flags cause mount to do something other than mount a new
filesystem on a mount point. You can only supply one of these flags to a
given mount syste call (with the exception of "rec", which needs one of
the other flags).

"rec" - apply this change recursively to all sub-mounts under this
directory.

"remount" - change VFS flags on an existing mount (you can change "ro",
"sync", "mandlock", and "i_version").

"move" - relocate an existing mount point to a new location.

"bind" - mount an existing file or directory on a new file/directory.

Bind mounts are a bit like a symlink, only at the mount level. A bind
mount doesn't have to be a whole filesystem, it can splice an arbitrary
directory (or file) to a new location. Yes, you can mount a file on a
file, and even mark it read only while doing so.

The rest of thecodepath selection is the "shared subtree" flags,
described in this lwn.net article:

  https://lwn.net/Articles/159077/

Different process groups can have different mount trees (clone with
CLONE_NEWNS creates a new mount tree, initially a duplicate of the
parent's but changes can stay local.)  These flags control which mounts
show up in which namespaces:

  "shared" - see the lwn.net article

  "private" - see the lwn.net article

  "slave" - see the lwn.net article

  "unbindable" - see the lwn.net article

(Unbindable also sprays a mount point in teflon so no new mounts stick
to it, you can't even loopback mount into it. And yes, this can apply to
bind mounts.)

--- Everything else

I need to look up whether these two are per-mountpoint or per-superblock:

  "silent" - ask filesystem driver not to log stuff into dmesg or the
console. (Used to be called "verbose", but that's not what it did.)

  "iversion" - increment a version counter in the inodes each time they
change. (Some filesystems record this, some don't.)

(Note: even though this is a vfs flag filesystems like ext4 look for
"i_version" and set this, but mount looks for "iversion")

Rob
-- 
GNU/Linux isn't: Linux=GPLv2, GNU=GPLv3+, they can't share code.
Either it's "mere aggregation", or a license violation.  Pick one.