[pve-devel] [PATCH v2 container 1/1] Add device passthrough

Wolfgang Bumiller w.bumiller at proxmox.com
Fri Nov 3 09:14:56 CET 2023


On Thu, Nov 02, 2023 at 03:28:22PM +0100, Filip Schauer wrote:
> 
> On 30/10/2023 14:34, Wolfgang Bumiller wrote:
> > On Tue, Oct 24, 2023 at 02:55:53PM +0200, Filip Schauer wrote:
> > > Add a dev[n] argument to the container config to pass devices through to
> > > a container. A device can be passed by its path. Alternatively a mapped
> > > USB device can be passed through with usbmapping=<name>.
> > > 
> > > Signed-off-by: Filip Schauer<f.schauer at proxmox.com>
> > > ---
> > >   src/PVE/LXC.pm        | 34 +++++++++++++++++++++++-
> > >   src/PVE/LXC/Config.pm | 60 +++++++++++++++++++++++++++++++++++++++++++
> > >   2 files changed, 93 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm
> > > index c9b5ba7..a3ddb62 100644
> > > --- a/src/PVE/LXC.pm
> > > +++ b/src/PVE/LXC.pm
> > > @@ -5,7 +5,8 @@ use warnings;
> > >   use Cwd qw();
> > >   use Errno qw(ELOOP ENOTDIR EROFS ECONNREFUSED EEXIST);
> > > -use Fcntl qw(O_RDONLY O_WRONLY O_NOFOLLOW O_DIRECTORY);
> > > +use Fcntl qw(O_RDONLY O_WRONLY O_NOFOLLOW O_DIRECTORY :mode);
> > > +use File::Basename;
> > >   use File::Path;
> > >   use File::Spec;
> > >   use IO::Poll qw(POLLIN POLLHUP);
> > > @@ -639,6 +640,37 @@ sub update_lxc_config {
> > >   	$raw .= "lxc.mount.auto = sys:mixed\n";
> > >       }
> > > +    # Clear passthrough directory from previous run
> > > +    my $passthrough_dir = "/var/lib/lxc/$vmid/passthrough";
> > > +    File::Path::rmtree($passthrough_dir);
> > I think we need to make a few changes here.
> > 
> > First: we don't necessarily need this directory.
> > Having a device list would certainly be nice, but it makes more sense to
> > just have a file we can easily parse (possibly even just a json hash),
> > like the `devices` file we already create in the pre-start hook, except
> > prepared *for* the pre-start hook, which *should* be able to just
> > `mknod` the devices right into the container's `/dev` on startup.
> 
> 
> Devices mknoded into the container's /dev directory in the pre-start
> hook will not be visible in the container once it is fully started.

Ah yes, I keep ignoring that.

> Meanwhile mknoding a device to a different path inside the container
> works fine. It seems that LXC mounts over the /dev directory. This can

/dev will be a tmpfs, yes.

> be solved by calling mknod in lxc-pve-autodev-hook, but this does not
> work with unprivileged containers without the mknod capability.
> 
> So are bind mounts our only option without modifying LXC,
> or am I overlooking something?

Sort of. We *could* still do this via a separate process we signal from
out of the autodev hook to do the work for it, but that'll make the
startup process even more convoluted.
And I think the seccomp proxying only starts after the entire init
setup, so we also can't just reuly on syscalld (of which the entire
point is to do mknods for the container 🙄).

I'm also working on a seccomp wrapper to allow unprivileged restores of
backups to `mknod()` the basics, but that, too, happens via seccomp, so
not really reusable in this case either (and syscalld is not suitable
for *this* either (for now) as it uses an lxc specific protocol and does
not by itself perform the seccomp setup...)

Perhaps there's a way to unify all that (at least partially) by teaching
syscalld an additional protocol we can use in all 3 cases (although the
the requirements are slightly different... here we only have "known"
paths & permissions, so we wouldn't need to deal with copying another
process' rootfs/chroot/fds/... to perform a syscall on their behalf,
which the other cases do need...)

So yeah, I suppose we can go the bind-mount route first, as it is
simpler, and then maybe change it later.

However, I still don't want to fill `/var/lib/lxc` on the host with
device nodes directly whenever we update the config via
`update_lxc_config()`.

So how about this:

In the prestart hook:
- mount a tmpfs to this path
- mknod the devices into it
And then in the autodev hook do the bind-mounting.

> 
> 
> > We'd also avoid "lingering" device nodes with potentially harmful
> > uid/permissions in /var, which is certainly better from a security POV.
> > 
> > But note that we do need the `lxc.cgroup2.*` entries before starting the
> > container in order to ensure the devices cgroup has the right
> > permissions.
> > 
> > > +
> > > +    PVE::LXC::Config->foreach_passthrough_device($conf, sub {
> > > +	my ($key, $sanitized_path) = @_;
> > > +
> > > +	my $absolute_path = "/$sanitized_path";
> > > +	my ($mode, $rdev) = (stat($absolute_path))[2, 6];
> > > +	die "Could not find major and minor ids of device $absolute_path.\n"
> > > +	    unless ($mode && $rdev);
> > > +
> > > +	my $major = PVE::Tools::dev_t_major($rdev);
> > > +	my $minor = PVE::Tools::dev_t_minor($rdev);
> > > +	my $device_type_char = S_ISBLK($mode) ? 'b' : 'c';
> > > +	my $passthrough_device_path = "$passthrough_dir/$sanitized_path";
> > > +	File::Path::make_path(dirname($passthrough_device_path));
> > > +	PVE::Tools::run_command([
> > > +	    '/usr/bin/mknod',
> > > +	    '-m', '0660',

Btw. with a property string used for the device entry, we could probably
also have an optional `mode` to use instead of `0660`, as well as a
`uid` and `gid` - but we'd need to map those with the container's id
mapping. Not sure if we already have helpers for that apart from getting
the root ids.

> > > +	    $passthrough_device_path,
> > > +	    $device_type_char,
> > > +	    $major,
> > > +	    $minor
> > > +	]);
> > It's probably worth adding a helper for the mknod syscall to
> > `PVE::Tools`, there are a bunch of syscalls in there already.
> > 
> > > +	chown 100000, 100000, $passthrough_device_path if ($unprivileged);
> > ^ This isn't necessarily the correct id. Users may have custom id
> > mappings.
> > `PVE::LXC::parse_id_maps($conf)` returns the mapping alongside the root
> > uid and gid. (See for example `sub mount_all` for how it's used.
> > 
> > > +
> > > +	$raw .= "lxc.cgroup2.devices.allow = $device_type_char $major:$minor rw\n";
> > > +	$raw .= "lxc.mount.entry = $passthrough_device_path $sanitized_path none bind,create=file\n";
> > > +    });
> > > +
> > >       # WARNING: DO NOT REMOVE this without making sure that loop device nodes
> > >       # cannot be exposed to the container with r/w access (cgroup perms).
> > >       # When this is enabled mounts will still remain in the monitor's namespace





More information about the pve-devel mailing list