[pve-devel] [PATCH qemu-server v3 3/5] await and kill lingering KVM thread when VM start reaches timeout

Fabian Grünbichler f.gruenbichler at proxmox.com
Wed Dec 21 12:14:41 CET 2022


On December 16, 2022 2:36 pm, Daniel Tschlatscher wrote:
> In some cases the VM API start method would return before the detached
> KVM process would have exited. This is especially problematic with HA,
> because the HA manager would think the VM started successfully, later
> see that it exited and start it again in an endless loop.
> 
> Moreover, another case exists when resuming a hibernated VM. In this
> case, the qemu thread will attempt to load the whole vmstate into
> memory before exiting.
> Depending on vmstate size, disk read speed, and similar factors this
> can take quite a while though and it is not possible to start the VM
> normally during this time.
> 
> To get around this, this patch intercepts the error, looks whether a
> corresponding KVM thread is still running, and waits for/kills it,
> before continuing.
> 
> Signed-off-by: Daniel Tschlatscher <d.tschlatscher at proxmox.com>
> ---
> 
> Changes from v2:
> * Rebased to current master
> * Changed warn to use 'log_warn' instead
> * Reworded log message when waiting for lingering qemu process
> 
>  PVE/QemuServer.pm | 40 +++++++++++++++++++++++++++++++++-------
>  1 file changed, 33 insertions(+), 7 deletions(-)
> 
> diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
> index 2adbe3a..f63dc3f 100644
> --- a/PVE/QemuServer.pm
> +++ b/PVE/QemuServer.pm
> @@ -5884,15 +5884,41 @@ sub vm_start_nolock {
>  		$tpmpid = start_swtpm($storecfg, $vmid, $tpm, $migratedfrom);
>  	    }
>  
> -	    my $exitcode = run_command($cmd, %run_params);
> -	    if ($exitcode) {
> -		if ($tpmpid) {
> -		    warn "stopping swtpm instance (pid $tpmpid) due to QEMU startup error\n";
> -		    kill 'TERM', $tpmpid;
> +	    eval {
> +		my $exitcode = run_command($cmd, %run_params);
> +
> +		if ($exitcode) {
> +		    if ($tpmpid) {
> +			log_warn "stopping swtpm instance (pid $tpmpid) due to QEMU startup
error\n";

this warn -> log_warn change kind of slipped in, it's not really part of this
patch?

> +			kill 'TERM', $tpmpid;
> +		    }
> +		    die "QEMU exited with code $exitcode\n";
>  		}
> -		die "QEMU exited with code $exitcode\n";
> +	    };
> +
> +	    if (my $err = $@) {
> +		my $pid = PVE::QemuServer::Helpers::vm_running_locally($vmid);
> +
> +		if ($pid ne "") {

can be combined:
if (my $pid = ...) {

}

(empty string evaluates to false in perl ;))

> +		    my $count = 0;
> +		    my $timeout = 300;
> +
> +		    print "Waiting $timeout seconds for detached qemu process $pid to exit\n";
> +		    while (($count < $timeout) &&
> +			PVE::QemuServer::Helpers::vm_running_locally($vmid)) {
> +			$count++;
> +			sleep(1);
> +		    }
> +

either here

> +		    if ($count >= $timeout) {
> +			log_warn "Reached timeout. Terminating now with SIGKILL\n";

or here, recheck that VM is still running and still has the same PID, and log
accordingly instead of KILLing if not..

the same is also true in _do_vm_stop

> +			kill(9, $pid);
> +		    }
> +		}
> +
> +		die $err;
>  	    }
> -	};
> +	}
>      };
>  
>      if ($conf->{hugepages}) {
> -- 
> 2.30.2
> 
> 
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> 
> 
> 





More information about the pve-devel mailing list