[pve-devel] [PATCH v2 container 1/1] Add device passthrough

Tue Nov 7 14:49:22 CET 2023

Patch v3 available:

https://lists.proxmox.com/pipermail/pve-devel/2023-November/059844.html

On 03/11/2023 09:14, Wolfgang Bumiller wrote:
> On Thu, Nov 02, 2023 at 03:28:22PM +0100, Filip Schauer wrote:
>> On 30/10/2023 14:34, Wolfgang Bumiller wrote:
>>> On Tue, Oct 24, 2023 at 02:55:53PM +0200, Filip Schauer wrote:
>>>> Add a dev[n] argument to the container config to pass devices through to
>>>> a container. A device can be passed by its path. Alternatively a mapped
>>>> USB device can be passed through with usbmapping=<name>.
>>>>
>>>> Signed-off-by: Filip Schauer<f.schauer at proxmox.com>
>>>> ---
>>>>    src/PVE/LXC.pm        | 34 +++++++++++++++++++++++-
>>>>    src/PVE/LXC/Config.pm | 60 +++++++++++++++++++++++++++++++++++++++++++
>>>>    2 files changed, 93 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm
>>>> index c9b5ba7..a3ddb62 100644
>>>> --- a/src/PVE/LXC.pm
>>>> +++ b/src/PVE/LXC.pm
>>>> @@ -5,7 +5,8 @@ use warnings;
>>>>    use Cwd qw();
>>>>    use Errno qw(ELOOP ENOTDIR EROFS ECONNREFUSED EEXIST);
>>>> -use Fcntl qw(O_RDONLY O_WRONLY O_NOFOLLOW O_DIRECTORY);
>>>> +use Fcntl qw(O_RDONLY O_WRONLY O_NOFOLLOW O_DIRECTORY :mode);
>>>> +use File::Basename;
>>>>    use File::Path;
>>>>    use File::Spec;
>>>>    use IO::Poll qw(POLLIN POLLHUP);
>>>> @@ -639,6 +640,37 @@ sub update_lxc_config {
>>>>    	$raw .= "lxc.mount.auto = sys:mixed\n";
>>>>        }
>>>> +    # Clear passthrough directory from previous run
>>>> +    my $passthrough_dir = "/var/lib/lxc/$vmid/passthrough";
>>>> +    File::Path::rmtree($passthrough_dir);
>>> I think we need to make a few changes here.
>>>
>>> First: we don't necessarily need this directory.
>>> Having a device list would certainly be nice, but it makes more sense to
>>> just have a file we can easily parse (possibly even just a json hash),
>>> like the `devices` file we already create in the pre-start hook, except
>>> prepared *for* the pre-start hook, which *should* be able to just
>>> `mknod` the devices right into the container's `/dev` on startup.
>>
>> Devices mknoded into the container's /dev directory in the pre-start
>> hook will not be visible in the container once it is fully started.
> Ah yes, I keep ignoring that.
>
>> Meanwhile mknoding a device to a different path inside the container
>> works fine. It seems that LXC mounts over the /dev directory. This can
> /dev will be a tmpfs, yes.
>
>> be solved by calling mknod in lxc-pve-autodev-hook, but this does not
>> work with unprivileged containers without the mknod capability.
>>
>> So are bind mounts our only option without modifying LXC,
>> or am I overlooking something?
> Sort of. We *could* still do this via a separate process we signal from
> out of the autodev hook to do the work for it, but that'll make the
> startup process even more convoluted.
> And I think the seccomp proxying only starts after the entire init
> setup, so we also can't just reuly on syscalld (of which the entire
> point is to do mknods for the container 🙄).
>
> I'm also working on a seccomp wrapper to allow unprivileged restores of
> backups to `mknod()` the basics, but that, too, happens via seccomp, so
> not really reusable in this case either (and syscalld is not suitable
> for *this* either (for now) as it uses an lxc specific protocol and does
> not by itself perform the seccomp setup...)
>
> Perhaps there's a way to unify all that (at least partially) by teaching
> syscalld an additional protocol we can use in all 3 cases (although the
> the requirements are slightly different... here we only have "known"
> paths & permissions, so we wouldn't need to deal with copying another
> process' rootfs/chroot/fds/... to perform a syscall on their behalf,
> which the other cases do need...)
>
> So yeah, I suppose we can go the bind-mount route first, as it is
> simpler, and then maybe change it later.
>
> However, I still don't want to fill `/var/lib/lxc` on the host with
> device nodes directly whenever we update the config via
> `update_lxc_config()`.
>
> So how about this:
>
> In the prestart hook:
> - mount a tmpfs to this path
> - mknod the devices into it
> And then in the autodev hook do the bind-mounting.
>
>>
>>> We'd also avoid "lingering" device nodes with potentially harmful
>>> uid/permissions in /var, which is certainly better from a security POV.
>>>
>>> But note that we do need the `lxc.cgroup2.*` entries before starting the
>>> container in order to ensure the devices cgroup has the right
>>> permissions.
>>>
>>>> +
>>>> +    PVE::LXC::Config->foreach_passthrough_device($conf, sub {
>>>> +	my ($key, $sanitized_path) = @_;
>>>> +
>>>> +	my $absolute_path = "/$sanitized_path";
>>>> +	my ($mode, $rdev) = (stat($absolute_path))[2, 6];
>>>> +	die "Could not find major and minor ids of device $absolute_path.\n"
>>>> +	    unless ($mode && $rdev);
>>>> +
>>>> +	my $major = PVE::Tools::dev_t_major($rdev);
>>>> +	my $minor = PVE::Tools::dev_t_minor($rdev);
>>>> +	my $device_type_char = S_ISBLK($mode) ? 'b' : 'c';
>>>> +	my $passthrough_device_path = "$passthrough_dir/$sanitized_path";
>>>> +	File::Path::make_path(dirname($passthrough_device_path));
>>>> +	PVE::Tools::run_command([
>>>> +	    '/usr/bin/mknod',
>>>> +	    '-m', '0660',
> Btw. with a property string used for the device entry, we could probably
> also have an optional `mode` to use instead of `0660`, as well as a
> `uid` and `gid` - but we'd need to map those with the container's id
> mapping. Not sure if we already have helpers for that apart from getting
> the root ids.
>
>>>> +	    $passthrough_device_path,
>>>> +	    $device_type_char,
>>>> +	    $major,
>>>> +	    $minor
>>>> +	]);
>>> It's probably worth adding a helper for the mknod syscall to
>>> `PVE::Tools`, there are a bunch of syscalls in there already.
>>>
>>>> +	chown 100000, 100000, $passthrough_device_path if ($unprivileged);
>>> ^ This isn't necessarily the correct id. Users may have custom id
>>> mappings.
>>> `PVE::LXC::parse_id_maps($conf)` returns the mapping alongside the root
>>> uid and gid. (See for example `sub mount_all` for how it's used.
>>>
>>>> +
>>>> +	$raw .= "lxc.cgroup2.devices.allow = $device_type_char $major:$minor rw\n";
>>>> +	$raw .= "lxc.mount.entry = $passthrough_device_path $sanitized_path none bind,create=file\n";
>>>> +    });
>>>> +
>>>>        # WARNING: DO NOT REMOVE this without making sure that loop device nodes
>>>>        # cannot be exposed to the container with r/w access (cgroup perms).
>>>>        # When this is enabled mounts will still remain in the monitor's namespace