[pve-devel] [PATCH 1/2] migrate: use ssh over socat provided UNIX socks as tunnel

Thomas Lamprecht t.lamprecht at proxmox.com
Tue May 31 11:34:47 CEST 2016


We cannot guarantee when the SSH forward Tunnel really becomes
ready. The check with the mtunnel API call did not help here as it
only checked that the SSH connection itself worked, which is
useless as else SSH wouldn't connect at all.

The Forward tunnel is a different channel in the SSH connection,
independent of the SSH `qm mtunnel` channel, so only if that works
it does not guarantees that our migration tunnel is up and ready.

When the node(s) where under load, or when we did parallel
migrations (migrateall), the migrate command was often started
before a tunnel was open and ready to receive data. This led to
a direct abortion of the migration and is the main cause in why
parallel migrations often leave two thirds or more VMs on the
source node.
The issue was tracked down to SSH after debugging the QEMU
process and enabling debug logging showed that the tunnel became
often to late available and ready, or not at all.

Fixing the Forward tunnel is quirky and not straight ahead, the
only way SSH gives as a possibility is to use -N (no command)
-f (background) and -o "ExitOnForwardFailure=yes", then it would
wait in the foreground until the tunnel is ready and only then
background itself. This is not quite the nicest way for our special
use case and our code base.
Waiting for the local port to become open and ready (through
/proc/net/tcp[6]] as a proof of concept is not enough, even if the
port is in the listening state and should theoretically accept
connections this still failed often as the tunnel was not yet fully
ready.

Further another problem would still be open if we tried to patch the
SSH Forward method we currently use - which we solve for free with
the approach of this patch - namely the problem that the method
to get an available port (next_migration_port) has a serious race
condition which could lead to multiple use of the same port on a
parallel migration (I observed this on my many test, seldom but if
it happens its really bad).

So lets now use socat to manage a connection over SSH. The end
points are UNIX socket bound to the VMID - thus no port so no race
and also no limitation of available ports (we reserved 50 for
migration).

The endpoints get created in /run/qemu-server/VMID.migrate and as
KVM/QEMU is able to use UNIX socket just as well as TCP we have not
to change much on the interaction with QEMU.
QEMU is started with the migrate_incoming url at the local
destination endpoint and creates the socket file, we then create
a listening socket on the source side and connect over SSH to the
destination.
Now the migration can be started by issuing the migrate qmp command
with an updated uri.

Another small issue was that we used open2 to make the tunnel in a
child process but never collected it when we closed tunnel.

Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>

Cc: Wolfgang Bumiller <w.bumiller at proxmox.com>
---
 PVE/QemuMigrate.pm | 184 ++++++++++++++++++++++-------------------------------
 PVE/QemuServer.pm  |  22 ++++---
 2 files changed, 90 insertions(+), 116 deletions(-)

diff --git a/PVE/QemuMigrate.pm b/PVE/QemuMigrate.pm
index a288627..865d06a 100644
--- a/PVE/QemuMigrate.pm
+++ b/PVE/QemuMigrate.pm
@@ -4,7 +4,7 @@ use strict;
 use warnings;
 use PVE::AbstractMigrate;
 use IO::File;
-use IPC::Open2;
+use POSIX qw(:sys_wait_h);
 use PVE::INotify;
 use PVE::Tools;
 use PVE::Cluster;
@@ -15,108 +15,62 @@ use PVE::RPCEnvironment;
 
 use base qw(PVE::AbstractMigrate);
 
-sub fork_command_pipe {
-    my ($self, $cmd) = @_;
-
-    my $reader = IO::File->new();
-    my $writer = IO::File->new();
-
-    my $orig_pid = $$;
-
-    my $cpid;
-
-    eval { $cpid = open2($reader, $writer, @$cmd); };
-
-    my $err = $@;
-
-    # catch exec errors
-    if ($orig_pid != $$) {
-	$self->log('err', "can't fork command pipe\n");
-	POSIX::_exit(1);
-	kill('KILL', $$);
+sub fork_tunnel {
+    my ($self, $raddr) = @_;
+
+    my $ssh_cmd = PVE::Tools::cmd2string([@{$self->{rem_ssh}}]);
+    my $cmd = ['socat', 'EXEC:'. $ssh_cmd .' "socat - UNIX:' . $raddr .'"', 'UNIX-LISTEN:' . $raddr];
+
+    my $tunnel = {};
+
+    my $pid = fork();
+    if (!defined($pid)) {
+	$self->log('err', "forking tunnel failed");
+	return undef;
+    } elsif ($pid == 0) {
+	exec(@$cmd);
+	exit(-1);
+    } else {
+	$tunnel->{cpid} = $pid;
     }
 
-    die $err if $err;
+    $tunnel->{raddr} = $raddr;
 
-    return { writer => $writer, reader => $reader, pid => $cpid };
+    return $tunnel;
 }
 
-sub finish_command_pipe {
-    my ($self, $cmdpipe, $timeout) = @_;
+sub finish_tunnel {
+    my ($self) = @_;
 
-    my $writer = $cmdpipe->{writer};
-    my $reader = $cmdpipe->{reader};
+    my $tunnel = $self->{tunnel};
+    my $cpid = $tunnel->{cpid};
+    return undef if !$cpid;
 
-    $writer->close();
-    $reader->close();
+    # collect child process
+    for (my $i = 1; $i < 20; $i++) {
+	my $waitpid = waitpid($cpid, WNOHANG);
+	last if (defined($waitpid) && ($waitpid == $cpid));
 
-    my $cpid = $cmdpipe->{pid};
+	if ($i == 10) {
+	    $self->log('info', "ssh tunnel still running - terminating now with SIGTERM");
+	    kill(15, $cpid);
 
-    if ($timeout) {
-	for (my $i = 0; $i < $timeout; $i++) {
-	    return if !PVE::ProcFSTools::check_process_running($cpid);
-	    sleep(1);
+	} elsif ($i >= 15) {
+	    $self->log('info', "ssh tunnel still running - terminating now with SIGKILL");
+	    kill(9, $cpid);
 	}
+	sleep (1);
     }
 
-    $self->log('info', "ssh tunnel still running - terminating now with SIGTERM\n");
-    kill(15, $cpid);
+    delete $tunnel->{cpid};
 
-    # wait again
-    for (my $i = 0; $i < 10; $i++) {
-	return if !PVE::ProcFSTools::check_process_running($cpid);
-	sleep(1);
-    }
+    # just to be sure, check on local..
+    my $cmd = ['rm', '-f', $tunnel->{raddr}]; #
+    PVE::Tools::run_command($cmd);
 
-    $self->log('info', "ssh tunnel still running - terminating now with SIGKILL\n");
-    kill 9, $cpid;
-    sleep 1;
-}
-
-sub fork_tunnel {
-    my ($self, $nodeip, $lport, $rport) = @_;
-
-    my @localtunnelinfo = $lport ? ('-L' , "$lport:localhost:$rport" ) : ();
-
-    my $cmd = [@{$self->{rem_ssh}}, @localtunnelinfo, 'qm', 'mtunnel' ];
-
-    my $tunnel = $self->fork_command_pipe($cmd);
-
-    my $reader = $tunnel->{reader};
-
-    my $helo;
-    eval {
-	PVE::Tools::run_with_timeout(60, sub { $helo = <$reader>; });
-	die "no reply\n" if !$helo;
-	die "no quorum on target node\n" if $helo =~ m/^no quorum$/;
-	die "got strange reply from mtunnel ('$helo')\n"
-	    if $helo !~ m/^tunnel online$/;
-    };
-    my $err = $@;
-
-    if ($err) {
-	$self->finish_command_pipe($tunnel);
-	die "can't open migration tunnel - $err";
-    }
-    return $tunnel;
-}
-
-sub finish_tunnel {
-    my ($self, $tunnel) = @_;
-
-    my $writer = $tunnel->{writer};
-
-    eval {
-	PVE::Tools::run_with_timeout(30, sub {
-	    print $writer "quit\n";
-	    $writer->flush();
-	});
-    };
-    my $err = $@;
-
-    $self->finish_command_pipe($tunnel, 30);
-
-    die $err if $err;
+    # .. and remote side that socket disappeared and is ready to reuse
+    unshift @{$cmd}, @{$self->{rem_ssh}};
+    PVE::Tools::run_command($cmd);
 }
 
 sub lock_vm {
@@ -330,6 +284,7 @@ sub phase2 {
 
     my $raddr;
     my $rport;
+    my $ruri; # the whole migration dst. URL (protocol:/address[:port])
     my $nodename = PVE::INotify::nodename();
 
     ## start on remote node
@@ -353,14 +308,22 @@ sub phase2 {
     # instead we pipe it through STDIN
     PVE::Tools::run_command($cmd, input => $spice_ticket, outfunc => sub {
 	my $line = shift;
+	$self->log('info', $line);
 
 	if ($line =~ m/^migration listens on tcp:(localhost|[\d\.]+|\[[\d\.:a-fA-F]+\]):(\d+)$/) {
 	    $raddr = $1;
 	    $rport = int($2);
+	    $ruri = "tcp:$raddr:$rport";
+	}
+	elsif ($line =~ m!^migration listens on unix:(/run/qemu-server/(\d+)\.migrate)$!) {
+	    $raddr = $1;
+	    die "Destination UNIX sockets VMID does not match source VMID" if $vmid ne $2;
+	    $ruri = "unix:$raddr";
 	}
 	elsif ($line =~ m/^migration listens on port (\d+)$/) {
 	    $raddr = "localhost";
 	    $rport = int($1);
+	    $ruri = "tcp:$raddr:$rport";
 	}
         elsif ($line =~ m/^spice listens on port (\d+)$/) {
 	    $spice_port = int($1);
@@ -372,14 +335,26 @@ sub phase2 {
 
     die "unable to detect remote migration address\n" if !$raddr;
 
-    ## create tunnel to remote port
-    $self->log('info', "starting ssh migration tunnel");
-    my $pfamily = PVE::Tools::get_host_address_family($nodename);
-    my $lport = ($raddr eq "localhost") ? PVE::Tools::next_migrate_port($pfamily) : undef;
-    $self->{tunnel} = $self->fork_tunnel($self->{nodeip}, $lport, $rport);
+    if ($ruri =~ /^unix:/) {
+	## create tunnel to remote port
+	$self->log('info', "start remote tunnel");
+	$self->{tunnel} = $self->fork_tunnel($raddr);
+
+	my $unix_socket_try = 0; # wait for the socket to become ready
+	while (! -S "/run/qemu-server/$vmid.migrate") {
+	    $unix_socket_try++;
+	    if ($unix_socket_try > 100) {
+		$self->{errors} = 1;
+		$self->finish_tunnel();
+		die "Timeout, migration socket $ruri did not get ready";
+	    }
+
+	    usleep(10000);
+	}
+    }
 
     my $start = time();
-    $self->log('info', "starting online/live migration on $raddr:$rport");
+    $self->log('info', "starting online/live migration on $ruri");
     $self->{livemigration} = 1;
 
     # load_defaults
@@ -438,10 +413,10 @@ sub phase2 {
     }
 
     eval {
-        PVE::QemuServer::vm_mon_cmd_nocheck($vmid, "migrate", uri => "tcp:$raddr:$rport");
+        PVE::QemuServer::vm_mon_cmd_nocheck($vmid, "migrate", uri => $ruri);
     };
     my $merr = $@;
-    $self->log('info', "migrate uri => tcp:$raddr:$rport failed: $merr") if $merr;
+    $self->log('info', "migrate uri => $ruri failed: $merr") if $merr;
 
     my $lstat = 0;
     my $usleep = 2000000;
@@ -538,13 +513,10 @@ sub phase2 {
 	    die "unable to parse migration status '$stat->{status}' - aborting\n";
 	}
     }
-    #to be sure tat the tunnel is closed 
+
+    # just to be sure that the tunnel always gets closed
     if ($self->{tunnel}) {
-	eval { finish_tunnel($self, $self->{tunnel});  };
-	if (my $err = $@) {
-	    $self->log('err', $err);
-	    $self->{errors} = 1;
-	}
+	finish_tunnel($self);
     }
 }
 
@@ -580,11 +552,7 @@ sub phase2_cleanup {
     }
 
     if ($self->{tunnel}) {
-	eval { finish_tunnel($self, $self->{tunnel});  };
-	if (my $err = $@) {
-	    $self->log('err', $err);
-	    $self->{errors} = 1;
-	}
+	finish_tunnel($self});
     }
 }
 
diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
index 33779b3..dd94de4 100644
--- a/PVE/QemuServer.pm
+++ b/PVE/QemuServer.pm
@@ -4302,21 +4302,27 @@ sub vm_start {
 
 	my ($cmd, $vollist, $spice_port) = config_to_command($storecfg, $vmid, $conf, $defaults, $forcemachine);
 
-	my $migrate_port = 0;
 	my $migrate_uri;
 	if ($statefile) {
 	    if ($statefile eq 'tcp') {
-		my $localip = "localhost";
+		# default to secure migrations: use ssh over a socat managed UNIX socket
+		# pair, as we a ssh forwards tunnel is not deterministic reliable ready
+		$migrate_uri = "unix:/run/qemu-server/$vmid.migrate";
+
 		my $datacenterconf = PVE::Cluster::cfs_read_file('datacenter.cfg');
-		my $nodename = PVE::INotify::nodename();
 		if ($datacenterconf->{migration_unsecure}) {
-			$localip = PVE::Cluster::remote_node_ip($nodename, 1);
-			$localip = "[$localip]" if Net::IP::ip_is_ipv6($localip);
+		    my $nodename = PVE::INotify::nodename();
+		    my $localip = PVE::Cluster::remote_node_ip($nodename, 1);
+		    $localip = "[$localip]" if Net::IP::ip_is_ipv6($localip);
+
+		    my $pfamily = PVE::Tools::get_host_address_family($nodename);
+		    my $migrate_port = PVE::Tools::next_migrate_port($pfamily) || 0;
+
+		    $migrate_uri = "tcp:${localip}:${migrate_port}";
 		}
-		my $pfamily = PVE::Tools::get_host_address_family($nodename);
-		$migrate_port = PVE::Tools::next_migrate_port($pfamily);
-		$migrate_uri = "tcp:${localip}:${migrate_port}";
+
 		push @$cmd, '-incoming', $migrate_uri;
+
 		push @$cmd, '-S';
 	    } else {
 		push @$cmd, '-loadstate', $statefile;
-- 
2.1.4





More information about the pve-devel mailing list