[pve-devel] [PATCH qemu-server v3 3/5] await and kill lingering KVM thread when VM start reaches timeout
Fabian Grünbichler
f.gruenbichler at proxmox.com
Wed Dec 21 12:14:41 CET 2022
On December 16, 2022 2:36 pm, Daniel Tschlatscher wrote:
> In some cases the VM API start method would return before the detached
> KVM process would have exited. This is especially problematic with HA,
> because the HA manager would think the VM started successfully, later
> see that it exited and start it again in an endless loop.
>
> Moreover, another case exists when resuming a hibernated VM. In this
> case, the qemu thread will attempt to load the whole vmstate into
> memory before exiting.
> Depending on vmstate size, disk read speed, and similar factors this
> can take quite a while though and it is not possible to start the VM
> normally during this time.
>
> To get around this, this patch intercepts the error, looks whether a
> corresponding KVM thread is still running, and waits for/kills it,
> before continuing.
>
> Signed-off-by: Daniel Tschlatscher <d.tschlatscher at proxmox.com>
> ---
>
> Changes from v2:
> * Rebased to current master
> * Changed warn to use 'log_warn' instead
> * Reworded log message when waiting for lingering qemu process
>
> PVE/QemuServer.pm | 40 +++++++++++++++++++++++++++++++++-------
> 1 file changed, 33 insertions(+), 7 deletions(-)
>
> diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
> index 2adbe3a..f63dc3f 100644
> --- a/PVE/QemuServer.pm
> +++ b/PVE/QemuServer.pm
> @@ -5884,15 +5884,41 @@ sub vm_start_nolock {
> $tpmpid = start_swtpm($storecfg, $vmid, $tpm, $migratedfrom);
> }
>
> - my $exitcode = run_command($cmd, %run_params);
> - if ($exitcode) {
> - if ($tpmpid) {
> - warn "stopping swtpm instance (pid $tpmpid) due to QEMU startup error\n";
> - kill 'TERM', $tpmpid;
> + eval {
> + my $exitcode = run_command($cmd, %run_params);
> +
> + if ($exitcode) {
> + if ($tpmpid) {
> + log_warn "stopping swtpm instance (pid $tpmpid) due to QEMU startup
error\n";
this warn -> log_warn change kind of slipped in, it's not really part of this
patch?
> + kill 'TERM', $tpmpid;
> + }
> + die "QEMU exited with code $exitcode\n";
> }
> - die "QEMU exited with code $exitcode\n";
> + };
> +
> + if (my $err = $@) {
> + my $pid = PVE::QemuServer::Helpers::vm_running_locally($vmid);
> +
> + if ($pid ne "") {
can be combined:
if (my $pid = ...) {
}
(empty string evaluates to false in perl ;))
> + my $count = 0;
> + my $timeout = 300;
> +
> + print "Waiting $timeout seconds for detached qemu process $pid to exit\n";
> + while (($count < $timeout) &&
> + PVE::QemuServer::Helpers::vm_running_locally($vmid)) {
> + $count++;
> + sleep(1);
> + }
> +
either here
> + if ($count >= $timeout) {
> + log_warn "Reached timeout. Terminating now with SIGKILL\n";
or here, recheck that VM is still running and still has the same PID, and log
accordingly instead of KILLing if not..
the same is also true in _do_vm_stop
> + kill(9, $pid);
> + }
> + }
> +
> + die $err;
> }
> - };
> + }
> };
>
> if ($conf->{hugepages}) {
> --
> 2.30.2
>
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>
>
>
More information about the pve-devel
mailing list