[pve-devel] Bug 1458 - PVE 5 live migration downtime degraded to several seconds (compared to PVE 4)

Fri Jul 28 14:55:01 CEST 2017

On Fri, Jul 28, 2017 at 01:22:31PM +0200, Alexandre DERUMIER wrote:
> pvesr through ssh
> -----------------
> root at kvmtest1 ~ # time /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=kvmtest2' root at 10.3.94.47 pvesr set-state 244 \''{}'\'
> 
> real	0m1.399s

I just realized SSH was probably slowed down in my test case by other
external factors, so heres your command repeated on a test cluster:

real    0m0.407s
user    0m0.004s
sys     0m0.000s

> 
> 
> locally
> --------
> root at kvmtest2:~# time pvesr set-state 244 {}
> real	0m1.137s
> 

real    0m0.268s
user    0m0.240s
sys     0m0.024s

> 
> so 40ms for ssh, and 1,137s for pvesr itself.

see above - wonder where the difference comes from? is it possible that
you have a big state file from testing? lots of guest configs?

> 
> (I think we could simply skip call if state if empty, but reusing ssh could help too a little bit)
> 
> 
> also , a simple
> 
> #time pvesr
> real	0m1.098s
> 
> (same for qm or other command)

in the 0.25-0.3 range here for all our commands (just for forking and
printing the usage)..

> >>that does not make sense - are you sure you haven't removed anything 
> >>else? qemu does not know or care about pvesr, so why should it resume 
> >>automatically? 
> 
> no it's not resume automatically. This is the log of an external script, calling qmp status  in loop
> to see how much time it's really paused.
> removing pvesr in phase3, reduce the pause time (between the end of phase2 and qm resume).
> 

well, you say that calling "qm" takes about a second on your system, and
we need to call "qm resume" over SSH for the VM to be continued. so how
can that happen in <100 ms?

I cannot reproduce your results at all. the only way I can achieve
downtime in the two digits ms range is by reverting commit
b37ecfe6ae7f7b557db7712ee6988cb0397306e9

observed from outside using ping -D -i 0.1 :

stock PVE 5:
[1501245897.627138] 64 bytes from 10.0.0.213: icmp_seq=70 ttl=64 time=0.273 ms
[1501245897.731102] 64 bytes from 10.0.0.213: icmp_seq=71 ttl=64 time=0.255 ms
[1501245897.835237] 64 bytes from 10.0.0.213: icmp_seq=72 ttl=64 time=0.352 ms
[1501245900.955324] 64 bytes from 10.0.0.213: icmp_seq=102 ttl=64 time=0.419 ms
[1501245901.059196] 64 bytes from 10.0.0.213: icmp_seq=103 ttl=64 time=0.298 ms
[1501245901.163360] 64 bytes from 10.0.0.213: icmp_seq=104 ttl=64 time=0.440 ms

no call to pvesr set-state over SSH:
[1501245952.119454] 64 bytes from 10.0.0.213: icmp_seq=63 ttl=64 time=0.586 ms
[1501245952.226278] 64 bytes from 10.0.0.213: icmp_seq=64 ttl=64 time=3.41 ms
[1501245955.027289] 64 bytes from 10.0.0.213: icmp_seq=91 ttl=64 time=0.414 ms
[1501245955.131317] 64 bytes from 10.0.0.213: icmp_seq=92 ttl=64 time=0.447 ms

stock PVE 5 with b37ecfe6ae7f7b557db7712ee6988cb0397306e9 reverted:
no downtime visible via ping, it's too small.