[pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node

Tue Apr 1 11:52:36 CEST 2025

> Alexandre Derumier via pve-devel <pve-devel at lists.proxmox.com> hat am 24.03.2025 12:15 CET geschrieben:
> verify that node is dead from corosync && ssh
> and move config file from /etc/pve directly

there are two reasons why this is dangerous and why we haven't exposed anything like this in the API or UI..

the first one is the risk of corruption - just because a (supposedly dead) node X is not reachable from your local node A doesn't mean it isn't still running. if it is still running, any guests that were already started before might still be running as well. any guests still running might still be able to talk to shared storage. unless there are other safeguards in place (like MMP, which is not a given for all storages), this can easily completely corrupt guest volumes if you attempt to recover and start such a guest. HA protects against this - node X will fence itself before node A will attempt recovery, so there is never a situation where both nodes will try to write to the same volume. just checking whether other cluster nodes can still connect to node X is not enough by any stretch to make this safe.

the second one is ownership of a VM/CT - PVE relies on node-local locking of guests to avoid contention. this only works because each guest/VMID has a clear owner - the node where the config is currently on. if you steal a config by moving it, you are violating this assumption. we only change the owner of a VMID in two scenarios with careful consideration of the implications:
- when doing a migration, which is initiated by the source node that is currently owning the guest, so it willingly hands over control to the new node which is safe by definition (no stealing involved and proper locking in place)
- when doing a HA recovery, which is protected by the HA locks and the watchdog - we know that the original node has been fenced before the recovery happens and we know it cannot do anything with the guest before it has been informed about the recovery (this is ensured by the design of the HA locks).
your code below is not protected by the HA stack, so there is a race involved - your node where the "deadnode migration" is initiated cannot lock the VMID in a way that the supposedly "dead" node knows about (config locking for guests is node-local, so it can only happen on the node that "owns" the config, anything else doesn't make sense/doesn't protect anything). if the "dead" node rejoins the cluster at the right moment, it still owns the VMID/config and can start it, while the other node thinks it can still steal it. there's also no protection against initiating multiple deadnode migrations in parallel for the same VMID, although of course all but one will fail because pmxcfs ensures the VMID.conf only exists under a single node. we'd need to give up node-local guest locking to close this gap, which is a no-go for performance reasons.

I understand that this would be convenient to expose, but it is also really dangerous without understanding the implications - and once there is an option to trigger it via the UI, no matter how many disclaimers you put on it, people will press that button and mess up and blame PVE. at the same time there is an actual implementation that safely implements it - it's called HA ;) so I'd rather spend some time focusing on improving the robustness of our HA stack, rather than adding such a footgun. 

> 
> Signed-off-by: Alexandre Derumier <alexandre.derumier at groupe-cyllene.com>
> ---
>  PVE/API2/Qemu.pm | 56 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 52 insertions(+), 4 deletions(-)
> 
> diff --git a/PVE/API2/Qemu.pm b/PVE/API2/Qemu.pm
> index 156b1c7b..58c454a6 100644
> --- a/PVE/API2/Qemu.pm
> +++ b/PVE/API2/Qemu.pm
> @@ -4764,6 +4764,9 @@ __PACKAGE__->register_method({
>  		description => "Target node.",
>  		completion =>  \&PVE::Cluster::complete_migration_target,
>              }),
> +            deadnode => get_standard_option('pve-node', {
> +                description => "Dead source node.",
> +            }),
>  	    online => {
>  		type => 'boolean',
>  		description => "Use online/live migration if VM is running. Ignored if VM is stopped.",
> @@ -4813,8 +4816,9 @@ __PACKAGE__->register_method({
>  	my $authuser = $rpcenv->get_user();
>  
>  	my $target = extract_param($param, 'target');
> +	my $deadnode = extract_param($param, 'deadnode');
>  
> -	my $localnode = PVE::INotify::nodename();
> +	my $localnode = $deadnode ? $deadnode : PVE::INotify::nodename();
>  	raise_param_exc({ target => "target is local node."}) if $target eq $localnode;
>  
>  	PVE::Cluster::check_cfs_quorum();
> @@ -4835,14 +4839,43 @@ __PACKAGE__->register_method({
>  	raise_param_exc({ migration_network => "Only root may use this option." })
>  	    if $param->{migration_network} && $authuser ne 'root at pam';
>  
> +	raise_param_exc({ deadnode => "Only root may use this option." })
> +	    if $param->{deadnode} && $authuser ne 'root at pam';
> +
>  	# test if VM exists
> -	my $conf = PVE::QemuConfig->load_config($vmid);
> +	my $conf = $deadnode ? PVE::QemuConfig->load_config($vmid, $deadnode) : PVE::QemuConfig->load_config($vmid);
>  
>  	# try to detect errors early
>  
>  	PVE::QemuConfig->check_lock($conf);
>  
> -	if (PVE::QemuServer::check_running($vmid)) {
> +        if ($deadnode) {
> +	    die "Can't do online migration of a dead node.\n" if $param->{online};
> +	    my $members = PVE::Cluster::get_members();
> +	    die "The deadnode $deadnode seem to be alive" if $members->{$deadnode} && $members->{$deadnode}->{online};
> +
> +	    print "test if deadnode $deadnode respond to ping\n";
> +	    eval {
> +		PVE::Tools::run_command("/usr/bin/ping -c 1 $members->{$deadnode}->{ip}");
> +	    };
> +	    if(!$@){
> +		die "error: ping to target $deadnode is still working. Node seem to be alive.";
> +	    }
> +
> +	    #make an extra ssh connection to double check that it's not just a corosync crash
> +	    my $sshinfo = PVE::SSHInfo::get_ssh_info($deadnode);
> +	    my $sshcmd = PVE::SSHInfo::ssh_info_to_command($sshinfo);
> +	    push @$sshcmd, 'hostname';
> +	    print "test if deadnode $deadnode respond to ssh\n";
> +	    eval {
> +		PVE::Tools::run_command($sshcmd, timeout => 1);
> +	    };
> +	    if(!$@){
> +		die "error: ssh connection to target $deadnode is still working. Node seem to be alive.";
> +	    }
> +
> +
> +	} elsif (PVE::QemuServer::check_running($vmid)) {
>  	    die "can't migrate running VM without --online\n" if !$param->{online};
>  
>  	    my $repl_conf = PVE::ReplicationConfig->new();
> @@ -4881,7 +4914,22 @@ __PACKAGE__->register_method({
>  	    PVE::QemuServer::check_storage_availability($storecfg, $conf, $target);
>  	}
>  
> -	if (PVE::HA::Config::vm_is_ha_managed($vmid) && $rpcenv->{type} ne 'ha') {
> +	if ($deadnode) {
> +	    my $realcmd = sub {
> +		my $config_fn = PVE::QemuConfig->config_file($vmid, $deadnode);
> +		my $new_config_fn = PVE::QemuConfig->config_file($vmid, $target);
> +
> +		rename($config_fn, $new_config_fn)
> +		    or die "failed to move config file to node '$target': $!\n";
> +	    };
> +
> +	    my $worker = sub {
> +		return PVE::GuestHelpers::guest_migration_lock($vmid, 10, $realcmd);
> +	    };
> +
> +	    return $rpcenv->fork_worker('qmigrate', $vmid, $authuser, $worker);
> +
> +        } elsif (PVE::HA::Config::vm_is_ha_managed($vmid) && $rpcenv->{type} ne 'ha') {
>  
>  	    my $hacmd = sub {
>  		my $upid = shift;
> -- 
> 2.39.5