[pve-devel] Bug 1458 - PVE 5 live migration downtime degraded to several seconds (compared to PVE 4)

Fri Jul 28 10:46:55 CEST 2017

On Fri, Jul 28, 2017 at 10:09:55AM +0200, Alexandre DERUMIER wrote:
> 
> I have added some timer and done a migration without storage replication
> 
> ->main migration loop : 150ms increase. (it's lower if I put a usleep of 1ms)
> 
> 2017-07-28 10:00:10 transfer_replication_state: 1.436832
> 2017-07-28 10:00:10 move config: 0.001174
> 2017-07-28 10:00:10 switch_replication_job_target: 0.003125
> 2017-07-28 10:00:12 qm resume: 1.634583    -> (this is the time from source, to get the response, not sure how many time it take exactly on remote)

I guess only marginally less on the target until the VM is actually
resumed.

> 
> seem to be transfer_replication_state which call
> my $cmd = [ @{$self->{rem_ssh}}, 'pvesr', 'set-state', $self->{vmid}, $state];
> 
> 
> I think calling remote qm command take some time to get response.
> Note that I don't use pvesr, so I think we should bypass theses commands if not needed.
> 

yes, checking whether a state / job exists earlier on, and only
transferring state and switching the direction conditionally if needed
would be an improvement for sure.

I wonder wether reusing (/extending) the existing SSH tunnel for the
commands we run on the target node might reduce the overhead as well?
for cleanup in error cases opening a new connection is probably still
advisable.

those two improvements might get us into the <1s range again, without
sacrificing consistency on the way.