[Toybox] [lxc-devel] Device Namespaces

Sun Oct 13 16:01:07 PDT 2013

Snippet of a conversation on linux-kernel relevant to implementing  
container support in toybox. Since Greg KH refuses to namespace  
devtmpfs (so each container can mount its own and see just the  
container's devices), this suggestion is to make a /dev/container  
directory within which you create subdirectory each container can bind  
mount on /dev, and then the host tool can manage the devices for the  
container.

This is part of the reason I've been holding off on an mdev rewrite:  
still not sure what exactly it should _do_.

On 09/30/2013 10:36:50 AM, Michael H. Warfield wrote:
> On Sun, 2013-09-29 at 13:06 -0700, Greg Kroah-Hartman wrote:
> > On Sun, Sep 29, 2013 at 10:28:55PM +0300, Amir Goldstein wrote:
[snip]
> You're right about the user space problem.  Something needs to manage
> the devices in a coherent manner as devices come and go and as
> containers come and go in asynchronous manner.  In my mind, the only
> place for that is in the host.  "Non trivial" is a jaw dropping
> understatement and I can see where you feel it would be impossible to
> manage in applying namespaces to devtmpfs.  That leaves the user space
> in the host.  I can see where it would be intractable in the kernel.
> 
> I may get beat mercilessly for suggesting this but, just as with
> cgroups, if we create a subdirectory in devtmpfs for subsystem (LXC)  
> and
> container, we can then bind mount that subtree off of devtmpfs to the
> container and then the host can map and manipulate the device subtree
> into the container (even if the container is denied mknod capability).
> That leaves the host to manage all the devices, which actually makes a
> LOT of sense (to me) since it should be responsible for the devices  
> and
> the overall kernel operations.  That would be no different than  
> needing
> to configure device passthroughs for KVM / VirtualBox / VMware
> hypervisors.
> 
> Example...  In the host I would have something like this...
> 
> /dev/lxc/
> romulus
> remus
> gemini
> janus
> 
> And then bind mount each of those subdirectories
> to /var/lib/lxc/${Container}/rootfs/dev directory.  Then map the  
> devices
> from the host /dev to the container /dev with mknod in the host and
> relative symlinks.
> 
> That also (I think) helps me deal with some of the (mis)behavior of
> systemd where it contains unconfigurable behavior (mounting devtmpfs)
> controlled by "magic cookies" (/dev mounted on another major/minor
> from / to disable it mounting devtmpfs).  I initially recoiled in  
> horror
> of the thought of overloading the devtmpfs subtree with container  
> based
> subdirectories, devices, and symlinks but the idea grew on me that  
> this
> might be better than what we're dealing with now of mounting tmpfs on
> the /dev mount point in all theses containers and then having to
> populate them just to prevent systemd from creating collisions with
> devtmpfs and the resulting violation of the container isolation.
> 
> It DOES still leave the problem of dealing with udev rules in the
> container and subsidiary device syslinks in the container which may  
> not
> correspond to the rules in the host.  That's still problem in my mind
> (but already present and miniscule to what we would be solving).  I
> could pattern match everything coming out of udev in a trigger and map
> devices and symlinks into the new subtree in the host but I have no  
> way
> to manage propagating the rules in the container down into the  
> processor
> in the host or a way to trigger those udev rules in the containers.
> Suggestions there might be nice (as well as the cat calls).  I'm not
> sure I have it clear in my head yet how I would deal with bringing up  
> a
> container and then mapping all the required existing devices over to  
> it.
> That's your user space problem in a nutshell.  That's easy to handle
> with udev as things come and go but, when the user space comes after  
> and
> udev isn't processing triggers, how do I handle the mappings.  That's
> also non-trivial in my mind.
> 
> Device creation would seem to be pretty trivial.  Device removal, not  
> so
> much.  If I create another node on devtmpfs and that major/minor gets
> removed, will it also get removed?  I also have to remove the  
> symlinks.
> The removal process just feels more complicated in my mind.
> 
> Greg, I think you are absolutely right, this needs to be managed in  
> user
> space and not in kernel space and we do have the tools to do it.  I
> think I can do some of it in a way that will suck less compared to how
> we're (LXC is) doing it now.  I'm just not so sure how comprehensive  
> the
> solution will be or how well it will work.
> 
> I've still got several other takeaways from that session to put a bow  
> on
> before really testing this idea further.  I really have not fully
> fleshed this idea out and it's going to take me some time.  There may
> also me some other corner cases I haven't considered.  And then  
> there's
> Android.  Sigh...
> 
> And maybe I'm just totally off base and crazy.  Wouldn't be the first
> time, won't be the last time.
> 
> > greg k-h
> 
> Regards,
> Mike
> --
> Michael H. Warfield (AI4NB) | (770) 985-6132 |  mhw at WittsEnd.com
>    /\/\|=mhw=|\/\/          | (678) 463-0932 |   
> http://www.wittsend.com/mhw/
>    NIC whois: MHW9          | An optimist believes we live in the  
> best of all
>  PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure  
> of it!
> 

 1381705267.0