[pve-devel] [PATCH pve-storage 1/1] storage/plugin: implement ref-counting for disknames in get_next_vm_diskname
Fiona Ebner
f.ebner at proxmox.com
Tue Jun 11 15:25:27 CEST 2024
Do you mean "reservation" rather than "ref-counting"?
Am 03.04.24 um 17:07 schrieb Hannes Duerr:
> As Fabian has already mentioned here[0], there can be a race between two
> parallel imports. More specifically, if both imports have --allow-rename
> set and the desired name already exists, then it can happen that both
> imports get the same name. The reason for this is that we currently only
> check which names have already been assigned in the vm config and then
> use the next free one, and do not check whether a running process has
> already been assigned the same name. However, while writing and testing
> the patch, I found that this is often not a problem, as
> - in most cases the source VM config is locked and therefore only one
> process can generate a new name (migrate, clone, move disk, update
> config)
> - a name must be specified and therefore no new name is generated (pvesm
> alloc)
> - the timeframe for the race is very short.
>
> At the same time, it is possible that I have not considered all edge
> cases and that there are other cases where the race can occur,
> especially with regard to remote_migrations. You can provoke the problem
> with two parallel imports on a host where local-lvm:vm-100-disk-0
> already exists:
>
> pvesm import local-lvm:vm-100-disk-0 raw+size <exported_volume> --allow-rename 1
>
>From a quick glance, it seems remote migration does one disk-import at a
time, i.e. only starts the next one once the previous one is finished.
In fact, even a malicious client could only do one import at a time,
because there only is one socket (per VM):
> $params->{unix} = "/run/qemu-server/$state->{vmid}.storage";
>
> return PVE::StorageTunnel::handle_disk_import($state, $params);
> Now that I've looked into the problem a bit, I'm not sure this patch is
> even necessary as it adds more complexity. So I wanted to ask for your
> opinion, wether you think it makes sense to add this change or not.
>
Please either use RFC for sending the patch then and/or put this
paragraph below the --- so it won't appear in the commit message.
IMHO, it would be nice to fix, but yes, it is a bunch of new code. I
think you could make it slightly simpler though, by doing the following
in the locked section:
1. iterate over reservations
-> if stale, remove
-> if not stale, add to $disk_list
2. get the next free name using the larger $disk_list
3. print the updated list of reservations including the new name
> The patch introduces a tmp file which stores the newly assigned disk
> names and the pid of the process which requested the disk name. If a
> second process is assigned the same name, it will see from the file that
> the name has already been assigned to another process, and will take the
> next available name. Reading and writing to the tmp file requires a lock
> to prevent races.
>
> [0] https://lists.proxmox.com/pipermail/pve-devel/2024-January/061526.html
>
> Signed-off-by: Hannes Duerr <h.duerr at proxmox.com>
> ---
> src/PVE/Storage/Plugin.pm | 99 ++++++++++++++++++++++++++++++++++-----
> 1 file changed, 86 insertions(+), 13 deletions(-)
>
> diff --git a/src/PVE/Storage/Plugin.pm b/src/PVE/Storage/Plugin.pm
> index 7456c8e..f76550a 100644
> --- a/src/PVE/Storage/Plugin.pm
> +++ b/src/PVE/Storage/Plugin.pm
> @@ -10,8 +10,10 @@ use File::chdir;
> use File::Path;
> use File::Basename;
> use File::stat qw();
> +use File::Copy;
>
So this is used for the single move() below? Any reason not to use
file_set_contents() like in other places? You are using locking after
all and then you would also not need the tmp file.
> use PVE::Tools qw(run_command);
> +use PVE::ProcFSTools;
> use PVE::JSONSchema qw(get_standard_option register_standard_option);
> use PVE::Cluster qw(cfs_register_file);
>
> @@ -779,23 +781,94 @@ my $get_vm_disk_number = sub {
> sub get_next_vm_diskname {
> my ($disk_list, $storeid, $vmid, $fmt, $scfg, $add_fmt_suffix) = @_;
>
> - $fmt //= '';
> - my $prefix = ($fmt eq 'subvol') ? 'subvol' : 'vm';
> - my $suffix = $add_fmt_suffix ? ".$fmt" : '';
> + my $code = sub {
> + my $reserved_names_file = "/var/tmp/pve-reserved-volnames";
> + my $tmp_file = "/var/tmp/pve-reserved-volnames.tmp";
>
> - my $disk_ids = {};
> - foreach my $disk (@$disk_list) {
> - my $disknum = $get_vm_disk_number->($disk, $scfg, $vmid, $suffix);
> - $disk_ids->{$disknum} = 1 if defined($disknum);
> - }
> + $fmt //= '';
> + my $prefix = ($fmt eq 'subvol') ? 'subvol' : 'vm';
> + my $suffix = $add_fmt_suffix ? ".$fmt" : '';
> + my $disk_id;
> + my $disk_ids = {};
>
> - for (my $i = 0; $i < $MAX_VOLUMES_PER_GUEST; $i++) {
> - if (!$disk_ids->{$i}) {
> - return "$prefix-$vmid-disk-$i$suffix";
> + foreach my $disk (@$disk_list) {
> + my $disknum = $get_vm_disk_number->($disk, $scfg, $vmid, $suffix);
> + $disk_ids->{$disknum} = 1 if defined($disknum);
> }
> - }
>
> - die "unable to allocate an image name for VM $vmid in storage '$storeid'\n"
> + for (my $i = 0; $i < $MAX_VOLUMES_PER_GUEST; $i++) {
> + if (!$disk_ids->{$i}) {
> + $disk_id = $i;
> + last;
> + }
> + }
> +
> + if (! -e $reserved_names_file) {
> + my $create_h = IO::File->new($reserved_names_file, "w") ||
> + die "can't open or create'$reserved_names_file' - $!\n";
> + print $create_h "$storeid $vmid $disk_id $$";
Not sure about using PIDs here. PIDs are re-used and can't this also be
the PID of some long-running daemon? Maybe we can use some expire time
like the next_unused_port() helper in PVE::Tools does? Can also be a few
minutes in this case, as we don't expect reservations to saturate.
[...]
> + my $new_disk_name = PVE::Tools::lock_file('/var/lock/pve-disk-names.lck', 10, $code);
The lock file is local to the node, so this won't work for shared
storages. For those, you would need track the reservations in a file on
the cluster file system.
> + die $@ if $@;
> +
> + if (!$new_disk_name) {
> + die "unable to allocate an image name for VM $vmid in storage '$storeid'\n";
> + }
> + return $new_disk_name;
> }
>
> sub find_free_diskname {
More information about the pve-devel
mailing list