[pve-devel] [PATCH qemu-server v3 3/5] await and kill lingering KVM thread when VM start reaches timeout

Daniel Tschlatscher d.tschlatscher at proxmox.com
Fri Dec 16 14:36:53 CET 2022


In some cases the VM API start method would return before the detached
KVM process would have exited. This is especially problematic with HA,
because the HA manager would think the VM started successfully, later
see that it exited and start it again in an endless loop.

Moreover, another case exists when resuming a hibernated VM. In this
case, the qemu thread will attempt to load the whole vmstate into
memory before exiting.
Depending on vmstate size, disk read speed, and similar factors this
can take quite a while though and it is not possible to start the VM
normally during this time.

To get around this, this patch intercepts the error, looks whether a
corresponding KVM thread is still running, and waits for/kills it,
before continuing.

Signed-off-by: Daniel Tschlatscher <d.tschlatscher at proxmox.com>
---

Changes from v2:
* Rebased to current master
* Changed warn to use 'log_warn' instead
* Reworded log message when waiting for lingering qemu process

 PVE/QemuServer.pm | 40 +++++++++++++++++++++++++++++++++-------
 1 file changed, 33 insertions(+), 7 deletions(-)

diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
index 2adbe3a..f63dc3f 100644
--- a/PVE/QemuServer.pm
+++ b/PVE/QemuServer.pm
@@ -5884,15 +5884,41 @@ sub vm_start_nolock {
 		$tpmpid = start_swtpm($storecfg, $vmid, $tpm, $migratedfrom);
 	    }
 
-	    my $exitcode = run_command($cmd, %run_params);
-	    if ($exitcode) {
-		if ($tpmpid) {
-		    warn "stopping swtpm instance (pid $tpmpid) due to QEMU startup error\n";
-		    kill 'TERM', $tpmpid;
+	    eval {
+		my $exitcode = run_command($cmd, %run_params);
+
+		if ($exitcode) {
+		    if ($tpmpid) {
+			log_warn "stopping swtpm instance (pid $tpmpid) due to QEMU startup error\n";
+			kill 'TERM', $tpmpid;
+		    }
+		    die "QEMU exited with code $exitcode\n";
 		}
-		die "QEMU exited with code $exitcode\n";
+	    };
+
+	    if (my $err = $@) {
+		my $pid = PVE::QemuServer::Helpers::vm_running_locally($vmid);
+
+		if ($pid ne "") {
+		    my $count = 0;
+		    my $timeout = 300;
+
+		    print "Waiting $timeout seconds for detached qemu process $pid to exit\n";
+		    while (($count < $timeout) &&
+			PVE::QemuServer::Helpers::vm_running_locally($vmid)) {
+			$count++;
+			sleep(1);
+		    }
+
+		    if ($count >= $timeout) {
+			log_warn "Reached timeout. Terminating now with SIGKILL\n";
+			kill(9, $pid);
+		    }
+		}
+
+		die $err;
 	    }
-	};
+	}
     };
 
     if ($conf->{hugepages}) {
-- 
2.30.2






More information about the pve-devel mailing list