[pve-devel] pvedaemon hanging because of qga retry

Alexandre DERUMIER aderumier at odiso.com
Fri May 18 08:25:18 CEST 2018


>>If you simply skip commands like 'guest-fsfreeze-thaw' 
>>your VM will get totally unusable (frozen). So I am not 
>>sure what you want to suggest? 

I'm not sure, but don't we have 2 timeout here ?

1 for connect , and 1 for command execution ?


I would like to be able to fast timeout on connect, as if qga agent is not running, it can't connect.
and if qga is running, keep the long executing timeout  as it seem to be needed by fsfreeze-fs.



>>A correct fix would be to implement an async command queue inside qemu... 

oh, I thinked it was already async.


----- Mail original -----
De: "dietmar" <dietmar at proxmox.com>
À: "aderumier" <aderumier at odiso.com>, "pve-devel" <pve-devel at pve.proxmox.com>
Envoyé: Vendredi 18 Mai 2018 07:27:54
Objet: Re: [pve-devel] pvedaemon hanging because of qga retry

If you simply skip commands like 'guest-fsfreeze-thaw' 
your VM will get totally unusable (frozen). So I am not 
sure what you want to suggest? 

A correct fix would be to implement an async command queue inside qemu... 


> On May 18, 2018 at 7:13 AM Alexandre DERUMIER <aderumier at odiso.com> wrote: 
> 
> 
> Seem to be introduced a long time ago in 2012 
> 
> https://git.proxmox.com/?p=qemu-server.git;a=blobdiff;f=PVE/QMPClient.pm;h=9829986ae77e82d340974e4d4128741ef85b4a0e;hp=d026f4d4c3012203d96660a311b1890e84e6aa18;hb=6d04217600f2145ee80d5d62231b8ade34f2e5ff;hpb=037a97463447b06ebf79a7f1d40c596d9955acee 
> 
> previously, connect timeout was 1s. 
> 
> I think we don't have qga support at this time. Not sure why it's have been 
> increased for qmp command ? 
> 
> 
> (with 1s, it's working fine if qga agent is down). 
> 
> 
> 
> ----- Mail original ----- 
> De: "aderumier" <aderumier at odiso.com> 
> À: "pve-devel" <pve-devel at pve.proxmox.com> 
> Envoyé: Vendredi 18 Mai 2018 00:37:30 
> Objet: Re: [pve-devel] pvedaemon hanging because of qga retry 
> 
> in qmpclient : open_connection 
> 
> for (;;) { 
> $count++; 
> $fh = IO::Socket::UNIX->new(Peer => $sname, Blocking => 0, Timeout => 1); 
> last if $fh; 
> if ($! != EINTR && $! != EAGAIN) { 
> die "unable to connect to VM $vmid $sotype socket - $!\n"; 
> } 
> my $elapsed = tv_interval($starttime, [gettimeofday]); 
> if ($elapsed >= $timeout) { 
> die "unable to connect to VM $vmid $sotype socket - timeout after $count 
> retries\n"; 
> } 
> usleep(100000); 
> } 
> 
> 
> we use $elapsed >= $timeout. 
> 
> Isn't this timeout for command execution time and not connect time ? 
> 
> I'm seeing at the end: 
> $self->{mux}->set_timeout($fh, $timeout); 
> 
> seem to be the command execution time in the muxer 
> 
> 
> 
> 
> 
> ----- Mail original ----- 
> De: "Alexandre Derumier" <aderumier at odiso.com> 
> À: "pve-devel" <pve-devel at pve.proxmox.com> 
> Envoyé: Jeudi 17 Mai 2018 23:16:36 
> Objet: [pve-devel] pvedaemon hanging because of qga retry 
> 
> Hi, 
> I had a strange behaviour today, 
> 
> with a vm running + qga enabled, but qga service down in the vm 
> 
> after theses attempts, 
> 
> May 17 21:54:01 kvm14 pvedaemon[20088]: VM 745 qmp command failed - VM 745 qmp 
> command 'guest-fsfreeze-thaw' failed - unable to connect to VM 745 qga socket 
> - timeout after 101 retries 
> May 17 21:55:10 kvm14 pvedaemon[20088]: VM 745 qmp command failed - VM 745 qmp 
> command 'guest-fsfreeze-thaw' failed - unable to connect to VM 745 qga socket 
> - timeout after 101 retries 
> 
> 
> some api request give 596 errors, mainly for the 745 vm 
> (/api2/json/nodes/kvm14/qemu/745/status/current), 
> but also for the server kvm14 on /api2/json/nodes/kvm14/qemu 
> 
> 
> restarting the pvedaemon have fixed the problem 
> 
> 10.59.100.141 - root at pam [17/05/2018:21:53:51 +0200] "POST 
> /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 - 
> 10.59.100.141 - root at pam [17/05/2018:21:55:00 +0200] "POST 
> /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 - 
> 10.59.100.141 - root at pam [17/05/2018:22:01:28 +0200] "POST 
> /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:01:30 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.59.100.141 - root at pam [17/05/2018:22:02:21 +0200] "GET 
> /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:03:05 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.59.100.141 - root at pam [17/05/2018:22:03:32 +0200] "GET 
> /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:04:40 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.59.100.141 - root at pam [17/05/2018:22:05:01 +0200] "GET 
> /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 
> 10.59.100.141 - root at pam [17/05/2018:22:05:59 +0200] "GET 
> /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:06:15 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:07:50 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:09:25 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:11:00 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:12:35 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.59.100.141 - root at pam [17/05/2018:22:14:19 +0200] "GET 
> /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:15:44 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:17:19 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:18:54 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:20:29 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:22:04 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:23:39 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:25:14 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:26:49 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:28:24 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:29:59 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:31:34 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:34:44 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.18 - root at pam [17/05/2018:22:35:30 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/733/status/current HTTP/1.1" 596 - 
> 10.59.100.141 - root at pam [17/05/2018:22:37:16 +0200] "GET 
> /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:37:24 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:38:59 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:40:08 +0200] "GET 
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 
> 
> 
> I'm don't see errors log for fsfreeze (called directly through api), 
> 
> but 
> 
> } elsif ($cmd->{execute} eq 'guest-fsfreeze-freeze') { 
> # freeze syncs all guest FS, if we kill it it stays in an unfreezable 
> # locked state with high probability, so use an generous timeout 
> $timeout = 60*60; # 1 hour 
> 
> 
> it was still running in pvedaemon ? 
> 
> same with 
> # qm agent 745 fsfreeze-freeze 
> 
> 
> I thinked we do a quest-agent ping with a small timeout, before sending the 
> longer commands. 
> 
> 
> 
> _______________________________________________ 
> pve-devel mailing list 
> pve-devel at pve.proxmox.com 
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 
> 
> _______________________________________________ 
> pve-devel mailing list 
> pve-devel at pve.proxmox.com 
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 
> 
> _______________________________________________ 
> pve-devel mailing list 
> pve-devel at pve.proxmox.com 
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




More information about the pve-devel mailing list