[pve-devel] [v2 qemu-server 00/10] improve live-migration downtime
Fabian Grünbichler
f.gruenbichler at proxmox.com
Fri Aug 4 14:53:57 CEST 2017
this patch series attempts to reduce the downtime occuring during
live-migration of VMs to sane levels by
- conditionalizing potentially unneeded SSH connections
- replacing commands over SSH with new 'qm mtunnel' commands
- reducing the polling interval to notice a completed migration faster
attempts to monitor down time via ping produced rather unreliable results,
probably cause of ARP? but old to old is reliable slowest there too..
following are durations in 'paused' state, between 'paused inmigrate' and
'running', measured with qmp status with 0.1 sleep inbetween, tests repeated 5
times each on a network-rate-limited virtual cluster.
with old polling, 2G RAM (actual RAM transfer in <2s, so no auto-reduction of
polling interval happens):
old code: average 3.2s
new to old: average 1.6s (skips pvesr set-state)
new to new: average 1.2s
with old polling, 8G RAM (auto-reduction of polling interval kicks in, slightly better results):
old code: average 2.7s
new to old: 1s
new to new: 0.7s
with reduced polling interval (last patch applied), 2G and 8G RAM:
new to old: 0.4s
new to new: one single instance of logged paused state over 5 migrations!
with reduced polling interval, 8G RAM, old code but with last patch applied:
2s
so it seems like this is the right combination of changes to get downtime back
to acceptable levels without sacrificing consistency.
commands which might be integrated into mtunnel as well in the future:
-pvesr set-state
-qm nbdstop
-qm unlock
changes from v1, based on Thomas' feedback:
------8<------8<------8<------8<------8<------8<------
diff --git a/PVE/CLI/qm.pm b/PVE/CLI/qm.pm
index 1792cb0..5dce10f 100755
--- a/PVE/CLI/qm.pm
+++ b/PVE/CLI/qm.pm
@@ -273,7 +273,7 @@ __PACKAGE__->register_method ({
};
$tunnel_write->("tunnel online");
- $tunnel_write->("ver 1.0");
+ $tunnel_write->("ver 1");
while (my $line = <>) {
chomp $line;
diff --git a/PVE/QemuMigrate.pm b/PVE/QemuMigrate.pm
index ac9ac22..fc847cc 100644
--- a/PVE/QemuMigrate.pm
+++ b/PVE/QemuMigrate.pm
@@ -124,7 +124,7 @@ sub write_tunnel {
};
die "writing to tunnel failed: $@\n" if $@;
- if ($tunnel->{version} && $tunnel->{version} >= 1.0) {
+ if ($tunnel->{version} && $tunnel->{version} >= 1) {
my $res = eval { $self->read_tunnel($tunnel, 10); };
die "no reply to command '$command': $@\n" if $@;
@@ -156,9 +156,12 @@ sub fork_tunnel {
eval {
my $ver = $self->read_tunnel($tunnel, 10);
- $ver =~ /^ver (\d+\.\d+)$/;
- $tunnel->{version} = $1 if $1;
- $self->log('info', "ssh tunnel version: $tunnel->{version}\n");
+ if ($ver =~ /^ver (\d+)$/) {
+ $tunnel->{version} = $1;
+ $self->log('info', "ssh tunnel $ver\n");
+ } else {
+ $err = "received invalid tunnel version string '$ver'\n" if !$err;
+ }
};
if ($err) {
@@ -923,7 +926,7 @@ sub phase3_cleanup {
die "Failed to move config to node '$self->{node}' - rename failed: $!\n"
if !rename($conffile, $newconffile);
- $self->switch_replication_job_target() if $self->{replicated_volumes};;
+ $self->switch_replication_job_target() if $self->{replicated_volumes};
if ($self->{livemigration}) {
if ($self->{storage_migration}) {
@@ -943,7 +946,7 @@ sub phase3_cleanup {
}
# config moved and nbd server stopped - now we can resume vm on target
- if ($tunnel && $tunnel->{version} && $tunnel->{version} >= 1.0) {
+ if ($tunnel && $tunnel->{version} && $tunnel->{version} >= 1) {
eval {
$self->write_tunnel($tunnel, 30, "resume $vmid");
};
@@ -953,13 +956,11 @@ sub phase3_cleanup {
}
} else {
my $cmd = [@{$self->{rem_ssh}}, 'qm', 'resume', $vmid, '--skiplock', '--nocheck'];
- eval {
- my $logf = sub {
- my $line = shift;
- $self->log('err', $line);
- };
- PVE::Tools::run_command($cmd, outfunc => sub {}, errfunc => $logf);
+ my $logf = sub {
+ my $line = shift;
+ $self->log('err', $line);
};
+ eval { PVE::Tools::run_command($cmd, outfunc => sub {}, errfunc => $logf); };
if (my $err = $@) {
$self->log('err', $err);
$self->{errors} = 1;
------>8------>8------>8------>8------>8------>8------
Fabian Grünbichler (10):
migrate: switch back to qm mtunnel
migrate: refactor mtunnel read/write
qm mtunnel: add tunnel version
migrate: read mtunnel version
qm mtunnel: add write helper
mtunnel: add and handle OK/ERR replies
qm mtunnel/migrate: add resume VMID command
migrate: finish tunnel in phase 3
migrate: keep track of replication
migrate: reduce polling intervals
PVE/CLI/qm.pm | 28 ++++++++++--
PVE/QemuMigrate.pm | 127 ++++++++++++++++++++++++++++++++++++++---------------
2 files changed, 117 insertions(+), 38 deletions(-)
--
2.11.0
More information about the pve-devel
mailing list