[pve-devel] [PATCH RFC container] Add device passthrough

Wolfgang Bumiller w.bumiller at proxmox.com
Fri Oct 20 09:08:03 CEST 2023


On Thu, Oct 19, 2023 at 02:18:56PM +0200, Filip Schauer wrote:
> Signed-off-by: Filip Schauer <f.schauer at proxmox.com>
> ---
> Is it reasonable to add a "dev[n]" argument to the pct.conf, given that
> device mount points only allow passing through block devices?

Why would they only allow block devices?

Also, Dominik recently added resource mappings for qemu for USB & PCI.
PCI might be tricky, but for USB we may be able to use these mappings as
well.
That said, "raw" `/dev` node pass-through still makes sense as a
separate thing for containers anyway since raw `lxc....` entries in the
container config can currently be very inconvenient to deal with
particularly with unprivileged containers (read on below for why...)

> 
>  src/PVE/LXC.pm        | 14 ++++++++++++++
>  src/PVE/LXC/Config.pm | 32 ++++++++++++++++++++++++++++++++
>  2 files changed, 46 insertions(+)
> 
> diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm
> index c9b5ba7..6090534 100644
> --- a/src/PVE/LXC.pm
> +++ b/src/PVE/LXC.pm
> @@ -639,6 +639,20 @@ sub update_lxc_config {
>  	$raw .= "lxc.mount.auto = sys:mixed\n";
>      }
>  
> +    foreach my $k (keys %$conf) {
> +	next if $k !~ m/^dev(\d+)$/;

We should add a helper `sub foreach_passthrough_device()` taking a
sub which gets `$key, $raw_value, $sanitized_path`.

The last one would currently be the `substr($path, 1)`. Having to do
this later on just makes it easier to mess this up.

Also, seeing this now this reminds me that - but not as part of this
series - we should add a lot of `foreach_*` subs for this sort of
iteration like we have in qemu-server and replace such existing
constructs if possible...

> +	my $devpath = $conf->{$k};
> +	die "Device $devpath does not exist\n" unless (-e $devpath);
> +
> +	my ($mode, $rdev) = (stat($devpath))[2, 6];
> +	die "Could not find major and minor ids of device $devpath.\n" unless ($mode && $rdev);
> +
> +	my $major = PVE::Tools::dev_t_major($rdev);
> +	my $minor = PVE::Tools::dev_t_minor($rdev);
> +	$raw .= "lxc.cgroup2.devices.allow = c $major:$minor rw\n";

^ the 'c' isn't necessarily correct.
You can use `S_ISBLK($mode) ? 'b' : 'c'`. `S_ISBLK` comes from the `Fcntl` package.
(You can find more info on file modes and S_ISBLK in `man 7 inode`)

> +	$raw .= "lxc.mount.entry = $devpath " . substr($devpath, 1) . " none bind,create=file\n";

^ See above - the substr() there stands out quite weird :-)

> +    }
> +
>      # WARNING: DO NOT REMOVE this without making sure that loop device nodes
>      # cannot be exposed to the container with r/w access (cgroup perms).
>      # When this is enabled mounts will still remain in the monitor's namespace
> diff --git a/src/PVE/LXC/Config.pm b/src/PVE/LXC/Config.pm
> index 56e1f10..4665ab1 100644
> --- a/src/PVE/LXC/Config.pm
> +++ b/src/PVE/LXC/Config.pm
> @@ -29,6 +29,7 @@ mkdir $lockdir;
>  mkdir "/etc/pve/nodes/$nodename/lxc";
>  my $MAX_MOUNT_POINTS = 256;
>  my $MAX_UNUSED_DISKS = $MAX_MOUNT_POINTS;
> +my $MAX_DEVICES = 256;
>  
>  # BEGIN implemented abstract methods from PVE::AbstractConfig
>  
> @@ -908,6 +909,37 @@ for (my $i = 0; $i < $MAX_UNUSED_DISKS; $i++) {
>      }
>  }
>  
> +PVE::JSONSchema::register_format('pve-lxc-dev-string', \&verify_lxc_dev_string);
> +sub verify_lxc_dev_string {
> +    my ($dev, $noerr) = @_;
> +
> +    if ($dev !~ m!^/dev/!) {
> +	return undef if $noerr;
> +	die "$dev does not start with /dev/\n";

This is where it gets interesting :-)

I can't find it now, but I faintly recall that at some point I had
suggested collecting devices in a different path than `/dev` to create
device nodes with the correct owner/group/permissions for unprivileged
containers to people on the forum.

See, we don't just need the device to be visible in the container, but
we also need it to be able to give the container's root user - which for
unprivileged containers is usually uid 100000 - access to it.

It *may* be fine for *some* devices to just `chown()` them in `/dev`,
but we generally don't want to reown devices on the host itself, so it
may make more sense to `mknod()` the same device somewhere in
`/var/lib/lxc/$vmid/passthrough` and `chown()` it there. (One
alternative would of course be ACL entries on the host, but IMO having a
directory specific for a container where you can see all its accessible
devics has its advantages...)

(With newer kernels we might be able to instead use an idmapped bind
mount, though, but pve-8's initial kernel does not support idmapping on
tmpfs yet :-/ (that starts at 6.3 IIRC))

We should also think about whether we'd like to support the containers
calling `mknod()` to "create" (but not really) those devices themselves.
for privileged containers this is easy enough, privileged means
privileged, so they can just do whatever.
For unprivileged containers, however, `mknod()` does not work at all,
and even if it did, we cannot allow it directly, since we have no
guarantee that after the next reboot the major/minor numbers will be the
same after the next reboot. So if we do want to support this, we'd need
to teach `pve-lxc-syscalld` to know about allowed devices and give it
the ability to hot-bind-mount such nodes (rather than actually
mknod()ing them).

> +    }
> +
> +    return $dev;
> +}
> +
> +my $dev_desc = {
> +    dev => {
> +	type => 'string',
> +	default_key => 1,
> +	format => 'pve-lxc-dev-string',
> +	format_description => 'Path',
> +	description => 'Device to pass through to the container',
> +	verbose_description => 'Path to the device to pass through to the container'
> +    }
> +};
> +
> +for (my $i = 0; $i < $MAX_DEVICES; $i++) {
> +    $confdesc->{"dev$i"} = {
> +	optional => 1,
> +	type => 'string', format => $dev_desc,
> +	description => "Device to pass through to the container",
> +    }
> +}
> +
>  sub parse_pct_config {
>      my ($filename, $raw, $strict) = @_;
>  
> -- 
> 2.39.2





More information about the pve-devel mailing list