[pve-devel] pvedaemon hanging because of qga retry

Alexandre DERUMIER aderumier at odiso.com
Thu May 17 23:16:36 CEST 2018


Hi,
I had a strange behaviour today,

with a vm running + qga enabled, but qga service down in the vm

after theses attempts,

May 17 21:54:01 kvm14 pvedaemon[20088]: VM 745 qmp command failed - VM 745 qmp command 'guest-fsfreeze-thaw' failed - unable to connect to VM 745 qga socket - timeout after 101 retries
May 17 21:55:10 kvm14 pvedaemon[20088]: VM 745 qmp command failed - VM 745 qmp command 'guest-fsfreeze-thaw' failed - unable to connect to VM 745 qga socket - timeout after 101 retries


some api request give 596 errors, mainly for the 745 vm (/api2/json/nodes/kvm14/qemu/745/status/current),
but also for the server kvm14 on /api2/json/nodes/kvm14/qemu 


restarting the pvedaemon have fixed the problem 

10.59.100.141 - root at pam [17/05/2018:21:53:51 +0200] "POST /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 -
10.59.100.141 - root at pam [17/05/2018:21:55:00 +0200] "POST /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 -
10.59.100.141 - root at pam [17/05/2018:22:01:28 +0200] "POST /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:01:30 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.59.100.141 - root at pam [17/05/2018:22:02:21 +0200] "GET /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:03:05 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.59.100.141 - root at pam [17/05/2018:22:03:32 +0200] "GET /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:04:40 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.59.100.141 - root at pam [17/05/2018:22:05:01 +0200] "GET /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 -
10.59.100.141 - root at pam [17/05/2018:22:05:59 +0200] "GET /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:06:15 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:07:50 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:09:25 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:11:00 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:12:35 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.59.100.141 - root at pam [17/05/2018:22:14:19 +0200] "GET /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:15:44 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:17:19 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:18:54 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:20:29 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:22:04 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:23:39 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:25:14 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:26:49 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:28:24 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:29:59 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:31:34 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:34:44 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.3.99.18 - root at pam [17/05/2018:22:35:30 +0200] "GET /api2/json/nodes/kvm14/qemu/733/status/current HTTP/1.1" 596 -
10.59.100.141 - root at pam [17/05/2018:22:37:16 +0200] "GET /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:37:24 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:38:59 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -
10.3.99.10 - root at pam [17/05/2018:22:40:08 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 -



I'm don't see errors log for fsfreeze (called directly through api),

but

        } elsif ($cmd->{execute} eq 'guest-fsfreeze-freeze') {
            # freeze syncs all guest FS, if we kill it it stays in an unfreezable
            # locked state with high probability, so use an generous timeout
            $timeout = 60*60; # 1 hour


it was still running in pvedaemon ?

same with
# qm agent 745 fsfreeze-freeze


I thinked we do a quest-agent ping with a small timeout, before sending the longer commands.






More information about the pve-devel mailing list