[pve-devel] Bug 1458 - PVE 5 live migration downtime degraded to several seconds (compared to PVE 4)

Tue Aug 8 19:02:21 CEST 2017

Hi,

I'm back from holiday,

I'll test last mtunnel patches tomorrow . Thanks !

----- Mail original -----
De: "aderumier" <aderumier at odiso.com>
À: "pve-devel" <pve-devel at pve.proxmox.com>
Envoyé: Vendredi 28 Juillet 2017 17:30:48
Objet: Re: [pve-devel] Bug 1458 - PVE 5 live migration downtime degraded to several seconds (compared to PVE 4)

I'm going on holiday tonight, so I'll not have time to work on this, sorry. 

my last test, using curl through pveproxy to resume the vm, give me around 60-70ms. (vs 1,3s with ssh+qm resume) 

----- Mail original ----- 
De: "aderumier" <aderumier at odiso.com> 
À: "pve-devel" <pve-devel at pve.proxmox.com> 
Envoyé: Vendredi 28 Juillet 2017 16:06:41 
Objet: Re: [pve-devel] Bug 1458 - PVE 5 live migration downtime degraded to several seconds (compared to PVE 4) 

>>without pvesr: 
>>[1501248903.534722] 64 bytes from 10.3.95.200: icmp_seq=275 ttl=64 time=0.512 ms 
>>[1501248906.444269] 64 bytes from 10.3.95.200: icmp_seq=302 ttl=64 time=0.499 ms 

>>27 ping loss : 2,7s 

>>2017-07-28 15:42:42,998 -- VM status: paused (inmigrate) 
>>2017-07-28 15:42:43,041 -- VM status: paused 
>>.... 
>>2017-07-28 15:42:46,013 -- VM status: running 

same test again,without pvesr, but replacing remote "qm resume" , with ssh + simple socat 
echo -e '{ "execute": "qmp_capabilities" } \n {"execute":"human-monitor-command","arguments":{"command-line":"cont"}}' | socat - UNIX-CONNECT:/var/run/qemu-server/244.qmp 

qemu : downtime result: 55 ms 

vm-status 
---------- 
2017-07-28 16:00:10,280 -- VM status: paused (inmigrate) 
2017-07-28 16:00:10,305 -- VM status: paused 
.... 
2017-07-28 16:00:10,540 -- VM status: running 

around 130ms (so overhead in phase3_cleanup) 

ping 0.1 
--------- 
[1501250471.649694] 64 bytes from 10.3.95.200: icmp_seq=82 ttl=64 time=2.35 ms 
[1501250472.072153] 64 bytes from 10.3.95.200: icmp_seq=86 ttl=64 time=0.414 ms 

around 4 packets loss, so 400ms (but it should be arp) 

----- Mail original ----- 
De: "aderumier" <aderumier at odiso.com> 
À: "pve-devel" <pve-devel at pve.proxmox.com> 
Envoyé: Vendredi 28 Juillet 2017 15:53:39 
Objet: Re: [pve-devel] Bug 1458 - PVE 5 live migration downtime degraded to several seconds (compared to PVE 4) 

>>in the 0.25-0.3 range here for all our commands (just for forking and 
>>printing the usage).. 

here a profiling of pvesr, on a old xeon 5110 1,6ghz. 
http://odisoweb1.odiso.net/nytprof/ 

(note that I'm around 0,3s on recent xeon e5 3,1ghz vs 1,2s ) 

>>well, you say that calling "qm" takes about a second on your system, and 
>>we need to call "qm resume" over SSH for the VM to be continued. so how 
>>can that happen in <100 ms? 

mmm, indeed.... 

ping -D -i 0.1 10.3.95.200 
--------------------------- 
with pvesr: 

[1501248711.774209] 64 bytes from 10.3.95.200: icmp_seq=79 ttl=64 time=0.669 ms 
[1501248716.088902] 64 bytes from 10.3.95.200: icmp_seq=119 ttl=64 time=1.17 ms 
40 ping loss 4s 

without pvesr: 
[1501248903.534722] 64 bytes from 10.3.95.200: icmp_seq=275 ttl=64 time=0.512 ms 
[1501248906.444269] 64 bytes from 10.3.95.200: icmp_seq=302 ttl=64 time=0.499 ms 

27 ping loss : 2,7s 

2017-07-28 15:42:42,998 -- VM status: paused (inmigrate) 
2017-07-28 15:42:43,041 -- VM status: paused 
.... 
2017-07-28 15:42:46,013 -- VM status: running 

I think I have done something wrong with my previous test. 
Theses result seem to be expected, if qm resume is also slow on my old box. 

----- Mail original ----- 
De: "Fabian Grünbichler" <f.gruenbichler at proxmox.com> 
À: "pve-devel" <pve-devel at pve.proxmox.com> 
Envoyé: Vendredi 28 Juillet 2017 14:55:01 
Objet: Re: [pve-devel] Bug 1458 - PVE 5 live migration downtime degraded to several seconds (compared to PVE 4) 

On Fri, Jul 28, 2017 at 01:22:31PM +0200, Alexandre DERUMIER wrote: 
> pvesr through ssh 
> ----------------- 
> root at kvmtest1 ~ # time /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=kvmtest2' root at 10.3.94.47 pvesr set-state 244 \''{}'\' 
> 
> real 0m1.399s 

I just realized SSH was probably slowed down in my test case by other 
external factors, so heres your command repeated on a test cluster: 

real 0m0.407s 
user 0m0.004s 
sys 0m0.000s 

> 
> 
> locally 
> -------- 
> root at kvmtest2:~# time pvesr set-state 244 {} 
> real 0m1.137s 
> 

real 0m0.268s 
user 0m0.240s 
sys 0m0.024s 

> 
> so 40ms for ssh, and 1,137s for pvesr itself. 

see above - wonder where the difference comes from? is it possible that 
you have a big state file from testing? lots of guest configs? 

> 
> (I think we could simply skip call if state if empty, but reusing ssh could help too a little bit) 
> 
> 
> also , a simple 
> 
> #time pvesr 
> real 0m1.098s 
> 
> (same for qm or other command) 

in the 0.25-0.3 range here for all our commands (just for forking and 
printing the usage).. 

> >>that does not make sense - are you sure you haven't removed anything 
> >>else? qemu does not know or care about pvesr, so why should it resume 
> >>automatically? 
> 
> no it's not resume automatically. This is the log of an external script, calling qmp status in loop 
> to see how much time it's really paused. 
> removing pvesr in phase3, reduce the pause time (between the end of phase2 and qm resume). 
> 

well, you say that calling "qm" takes about a second on your system, and 
we need to call "qm resume" over SSH for the VM to be continued. so how 
can that happen in <100 ms? 

I cannot reproduce your results at all. the only way I can achieve 
downtime in the two digits ms range is by reverting commit 
b37ecfe6ae7f7b557db7712ee6988cb0397306e9 

observed from outside using ping -D -i 0.1 : 

stock PVE 5: 
[1501245897.627138] 64 bytes from 10.0.0.213: icmp_seq=70 ttl=64 time=0.273 ms 
[1501245897.731102] 64 bytes from 10.0.0.213: icmp_seq=71 ttl=64 time=0.255 ms 
[1501245897.835237] 64 bytes from 10.0.0.213: icmp_seq=72 ttl=64 time=0.352 ms 
[1501245900.955324] 64 bytes from 10.0.0.213: icmp_seq=102 ttl=64 time=0.419 ms 
[1501245901.059196] 64 bytes from 10.0.0.213: icmp_seq=103 ttl=64 time=0.298 ms 
[1501245901.163360] 64 bytes from 10.0.0.213: icmp_seq=104 ttl=64 time=0.440 ms 

no call to pvesr set-state over SSH: 
[1501245952.119454] 64 bytes from 10.0.0.213: icmp_seq=63 ttl=64 time=0.586 ms 
[1501245952.226278] 64 bytes from 10.0.0.213: icmp_seq=64 ttl=64 time=3.41 ms 
[1501245955.027289] 64 bytes from 10.0.0.213: icmp_seq=91 ttl=64 time=0.414 ms 
[1501245955.131317] 64 bytes from 10.0.0.213: icmp_seq=92 ttl=64 time=0.447 ms 

stock PVE 5 with b37ecfe6ae7f7b557db7712ee6988cb0397306e9 reverted: 
no downtime visible via ping, it's too small. 

_______________________________________________ 
pve-devel mailing list 
pve-devel at pve.proxmox.com 
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 

_______________________________________________ 
pve-devel mailing list 
pve-devel at pve.proxmox.com 
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 

_______________________________________________ 
pve-devel mailing list 
pve-devel at pve.proxmox.com 
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 

_______________________________________________ 
pve-devel mailing list 
pve-devel at pve.proxmox.com 
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel