[Toybox] switch_root: fix switch_root's interaction with mount namespaces (PR #593)

Thu Jan 8 15:25:44 PST 2026

On 1/8/26 11:37, Dima Zavin wrote:
 > Commit c04b565204eb6b7e3508ac8dd42539ab97752635> reworked how 
switch_root moves mounts into the new root, but it
> inadvertently removed the moving of the root itself onto / for the mount
> namespace before chrooting.
> 
> This confuses future users of the mount namespaces since root mount gets
> preserved and thus entering any derived mount namespace retains the pre-chroot
> structure.

Sigh, switch_root is one of the commands I need to get scripts/test.sh 
to run under mkroot to automatically regression test.

> Found in yocto (scarthgap, toybox 0.8.11) where the mount namespaces
> contained just /rootfs in their /. Repro is simple:
> 
> Before:
> 
> ```
> % sudo nsenter -m -t $$
> nsenter: failed to execute /bin/sh: No such file or directory

Huh, does nsenter -m effectively do a chdir / ? Does it _always_ break 
out of a normal chroot?

   $ cd toybox/root/x86_64
   $ sudo chroot fs
   password:
   $ mount -t proc proc /proc
   $ nsenter -m -t $$ /bin/sh
   # ls
   # head -n 1 /etc/os-release
   PRETTY_NAME="Devuan GNU/Linux 5 (daedalus)"

Apparently so. Good to know, I guess. (Dear lkml: what the? I know you 
refused to patch the cd ../../../.. hole but this is just silly.)

> % sudo nsenter -m -t $$ /rootfs/usr/lib64/ld-linux-x86-64.so.2 \
>     --library-path /rootfs/lib:/rootfs/lib64:/rootfs/usr/lib64:/rootfs/usr/lib \
>    /rootfs/usr/sbin/chroot.coreutils /rootfs

You manually ran the dynamic linker against chroot.coreutils, to chroot 
into /rootfs, within which I'm assuming it ran /bin/sh. Not sure what 
that proved, you just chrooted _back_ without the mount --move a second 
time.

> #
> ```
> 
> After:
> ```
> % sudo nsenter -m -t $$
> #
> ```
> 
> Fixes #557
> 
> Signed-off-by: Dima Zavin <dmitriyz at waymo.com>
> ---
>  toys/other/switch_root.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/toys/other/switch_root.c b/toys/other/switch_root.c
> index 1c750608f..b63b92ec3 100644
> --- a/toys/other/switch_root.c
> +++ b/toys/other/switch_root.c
> @@ -97,6 +97,12 @@ void switch_root_main(void)
>    // Ok, enough safety checks: wipe root partition.
>    dirtree_read("/", del_node);
>  
> +  // Fix the appearance of the mount table in the newroot chroot
> +  if (mount(".", "/", NULL, MS_MOVE, NULL)) {
> +    perror_msg("mount");
> +    goto panic;
> +  }
> +
>    // Enter the new root before starting init
>    if (chroot(".")) {
>      perror_msg("chroot");

In theory the dirtree_read("/") is supposed to operate on "/" as well as 
the children. In practice there's a sequencing issue with mounts being 
under other mounts (which this is a trivial case of). if you have two 
mount points arranged dir1/dir2, you need to move dir2 to /tmp, move 
dir1 new, and them move /tmp to new/dir1/dir2. (There's no MS_MOVE_ALL 
flag I'm aware of.)

The easy fix for the current case is to DIRTREE_COMEAGAIN and handle all 
the moves in the second callback, that way all children are handled 
before their parents. (This avoids adding a second explicit mount() call 
when the first mount() call can theoretically already handle it. Single 
Point of Truth and all that...)

This doesn't solve the larger problem (ala /dev being a devtmpfs and 
/dev/pts being a devpts), but might address _this_ issue without adding 
significant code.

Do I _want_ to try to fix the larger issue? I'd need an arbitrary number 
of mountpoints to hold arbitrarily deep trees while moving them, and I'm 
not guaranteed to have any writeable space to mkdir in. That's why I 
didn't try to tackle it before. In theory "switch_root before doing your 
setup" has been the order of the day... in which case you don't need to 
care about any child mounts, you just want to swap two mounts the way 
pivot_root does.

Would the simpler non-recursive version break anybody? I have no idea. 
You'd want to move /dev is if CONFIG_DEVTMPFS_MOUNT worked but the 
kernel guys have refused 
https://landley.net/bin/mkroot/0.8.13/linux-patches/0003-Wire-up-CONFIG_DEVTMPFS_MOUNT-to-initramfs.patch 
and friends for NINE YEARS now. (Which is why the stupid "static 
initramfs has no stdin/stdout/stderr when it launches PID 1" bug keeps 
cropping back up, because the kernel has inconsistent behavior in 
different codepaths...)

Are there existing users would be broken by doing less, or is everybody 
just calling switch_root as the first thing and then have the "real" 
init script live in the new filesystem?

Rob