[pve-devel] [PATCH 1/2] migrate: use ssh over socat provided UNIX socks as tunnel
Thomas Lamprecht
t.lamprecht at proxmox.com
Tue May 31 11:38:43 CEST 2016
On 05/31/2016 11:34 AM, Thomas Lamprecht wrote:
> We cannot guarantee when the SSH forward Tunnel really becomes
> ready. The check with the mtunnel API call did not help here as it
> only checked that the SSH connection itself worked, which is
> useless as else SSH wouldn't connect at all.
>
> The Forward tunnel is a different channel in the SSH connection,
> independent of the SSH `qm mtunnel` channel, so only if that works
> it does not guarantees that our migration tunnel is up and ready.
>
> When the node(s) where under load, or when we did parallel
> migrations (migrateall), the migrate command was often started
> before a tunnel was open and ready to receive data. This led to
> a direct abortion of the migration and is the main cause in why
> parallel migrations often leave two thirds or more VMs on the
> source node.
> The issue was tracked down to SSH after debugging the QEMU
> process and enabling debug logging showed that the tunnel became
> often to late available and ready, or not at all.
>
> Fixing the Forward tunnel is quirky and not straight ahead, the
> only way SSH gives as a possibility is to use -N (no command)
> -f (background) and -o "ExitOnForwardFailure=yes", then it would
> wait in the foreground until the tunnel is ready and only then
> background itself. This is not quite the nicest way for our special
> use case and our code base.
> Waiting for the local port to become open and ready (through
> /proc/net/tcp[6]] as a proof of concept is not enough, even if the
> port is in the listening state and should theoretically accept
> connections this still failed often as the tunnel was not yet fully
> ready.
>
> Further another problem would still be open if we tried to patch the
> SSH Forward method we currently use - which we solve for free with
> the approach of this patch - namely the problem that the method
> to get an available port (next_migration_port) has a serious race
> condition which could lead to multiple use of the same port on a
> parallel migration (I observed this on my many test, seldom but if
> it happens its really bad).
>
> So lets now use socat to manage a connection over SSH. The end
> points are UNIX socket bound to the VMID - thus no port so no race
> and also no limitation of available ports (we reserved 50 for
> migration).
>
> The endpoints get created in /run/qemu-server/VMID.migrate and as
> KVM/QEMU is able to use UNIX socket just as well as TCP we have not
> to change much on the interaction with QEMU.
> QEMU is started with the migrate_incoming url at the local
> destination endpoint and creates the socket file, we then create
> a listening socket on the source side and connect over SSH to the
> destination.
> Now the migration can be started by issuing the migrate qmp command
> with an updated uri.
>
> Another small issue was that we used open2 to make the tunnel in a
> child process but never collected it when we closed tunnel.
>
> Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
>
> Cc: Wolfgang Bumiller <w.bumiller at proxmox.com>
> ---
> PVE/QemuMigrate.pm | 184 ++++++++++++++++++++++-------------------------------
> PVE/QemuServer.pm | 22 ++++---
> 2 files changed, 90 insertions(+), 116 deletions(-)
>
> diff --git a/PVE/QemuMigrate.pm b/PVE/QemuMigrate.pm
> index a288627..865d06a 100644
> --- a/PVE/QemuMigrate.pm
> +++ b/PVE/QemuMigrate.pm
> @@ -4,7 +4,7 @@ use strict;
> use warnings;
> use PVE::AbstractMigrate;
> use IO::File;
> -use IPC::Open2;
> +use POSIX qw(:sys_wait_h);
> use PVE::INotify;
> use PVE::Tools;
> use PVE::Cluster;
> @@ -15,108 +15,62 @@ use PVE::RPCEnvironment;
>
> use base qw(PVE::AbstractMigrate);
>
> -sub fork_command_pipe {
> - my ($self, $cmd) = @_;
> -
> - my $reader = IO::File->new();
> - my $writer = IO::File->new();
> -
> - my $orig_pid = $$;
> -
> - my $cpid;
> -
> - eval { $cpid = open2($reader, $writer, @$cmd); };
> -
> - my $err = $@;
> -
> - # catch exec errors
> - if ($orig_pid != $$) {
> - $self->log('err', "can't fork command pipe\n");
> - POSIX::_exit(1);
> - kill('KILL', $$);
> +sub fork_tunnel {
> + my ($self, $raddr) = @_;
> +
> + my $ssh_cmd = PVE::Tools::cmd2string([@{$self->{rem_ssh}}]);
> + my $cmd = ['socat', 'EXEC:'. $ssh_cmd .' "socat - UNIX:' . $raddr .'"', 'UNIX-LISTEN:' . $raddr];
> +
> + my $tunnel = {};
> +
> + my $pid = fork();
> + if (!defined($pid)) {
> + $self->log('err', "forking tunnel failed");
> + return undef;
> + } elsif ($pid == 0) {
> + exec(@$cmd);
> + exit(-1);
> + } else {
> + $tunnel->{cpid} = $pid;
> }
>
> - die $err if $err;
> + $tunnel->{raddr} = $raddr;
>
> - return { writer => $writer, reader => $reader, pid => $cpid };
> + return $tunnel;
> }
>
> -sub finish_command_pipe {
> - my ($self, $cmdpipe, $timeout) = @_;
> +sub finish_tunnel {
> + my ($self) = @_;
>
> - my $writer = $cmdpipe->{writer};
> - my $reader = $cmdpipe->{reader};
> + my $tunnel = $self->{tunnel};
> + my $cpid = $tunnel->{cpid};
> + return undef if !$cpid;
>
> - $writer->close();
> - $reader->close();
> + # collect child process
> + for (my $i = 1; $i < 20; $i++) {
> + my $waitpid = waitpid($cpid, WNOHANG);
> + last if (defined($waitpid) && ($waitpid == $cpid));
>
> - my $cpid = $cmdpipe->{pid};
> + if ($i == 10) {
> + $self->log('info', "ssh tunnel still running - terminating now with SIGTERM");
> + kill(15, $cpid);
>
> - if ($timeout) {
> - for (my $i = 0; $i < $timeout; $i++) {
> - return if !PVE::ProcFSTools::check_process_running($cpid);
> - sleep(1);
> + } elsif ($i >= 15) {
> + $self->log('info', "ssh tunnel still running - terminating now with SIGKILL");
> + kill(9, $cpid);
> }
> + sleep (1);
> }
>
> - $self->log('info', "ssh tunnel still running - terminating now with SIGTERM\n");
> - kill(15, $cpid);
> + delete $tunnel->{cpid};
>
> - # wait again
> - for (my $i = 0; $i < 10; $i++) {
> - return if !PVE::ProcFSTools::check_process_running($cpid);
> - sleep(1);
> - }
> + # just to be sure, check on local..
> + my $cmd = ['rm', '-f', $tunnel->{raddr}]; #
> + PVE::Tools::run_command($cmd);
>
> - $self->log('info', "ssh tunnel still running - terminating now with SIGKILL\n");
> - kill 9, $cpid;
> - sleep 1;
> -}
> -
> -sub fork_tunnel {
> - my ($self, $nodeip, $lport, $rport) = @_;
> -
> - my @localtunnelinfo = $lport ? ('-L' , "$lport:localhost:$rport" ) : ();
> -
> - my $cmd = [@{$self->{rem_ssh}}, @localtunnelinfo, 'qm', 'mtunnel' ];
> -
> - my $tunnel = $self->fork_command_pipe($cmd);
> -
> - my $reader = $tunnel->{reader};
> -
> - my $helo;
> - eval {
> - PVE::Tools::run_with_timeout(60, sub { $helo = <$reader>; });
> - die "no reply\n" if !$helo;
> - die "no quorum on target node\n" if $helo =~ m/^no quorum$/;
> - die "got strange reply from mtunnel ('$helo')\n"
> - if $helo !~ m/^tunnel online$/;
> - };
> - my $err = $@;
> -
> - if ($err) {
> - $self->finish_command_pipe($tunnel);
> - die "can't open migration tunnel - $err";
> - }
> - return $tunnel;
> -}
> -
> -sub finish_tunnel {
> - my ($self, $tunnel) = @_;
> -
> - my $writer = $tunnel->{writer};
> -
> - eval {
> - PVE::Tools::run_with_timeout(30, sub {
> - print $writer "quit\n";
> - $writer->flush();
> - });
> - };
> - my $err = $@;
> -
> - $self->finish_command_pipe($tunnel, 30);
> -
> - die $err if $err;
> + # .. and remote side that socket disappeared and is ready to reuse
> + unshift @{$cmd}, @{$self->{rem_ssh}};
> + PVE::Tools::run_command($cmd);
> }
>
> sub lock_vm {
> @@ -330,6 +284,7 @@ sub phase2 {
>
> my $raddr;
> my $rport;
> + my $ruri; # the whole migration dst. URL (protocol:/address[:port])
> my $nodename = PVE::INotify::nodename();
>
> ## start on remote node
> @@ -353,14 +308,22 @@ sub phase2 {
> # instead we pipe it through STDIN
> PVE::Tools::run_command($cmd, input => $spice_ticket, outfunc => sub {
> my $line = shift;
> + $self->log('info', $line);
>
> if ($line =~ m/^migration listens on tcp:(localhost|[\d\.]+|\[[\d\.:a-fA-F]+\]):(\d+)$/) {
> $raddr = $1;
> $rport = int($2);
> + $ruri = "tcp:$raddr:$rport";
> + }
> + elsif ($line =~ m!^migration listens on unix:(/run/qemu-server/(\d+)\.migrate)$!) {
> + $raddr = $1;
> + die "Destination UNIX sockets VMID does not match source VMID" if $vmid ne $2;
> + $ruri = "unix:$raddr";
> }
> elsif ($line =~ m/^migration listens on port (\d+)$/) {
> $raddr = "localhost";
> $rport = int($1);
> + $ruri = "tcp:$raddr:$rport";
> }
> elsif ($line =~ m/^spice listens on port (\d+)$/) {
> $spice_port = int($1);
> @@ -372,14 +335,26 @@ sub phase2 {
>
> die "unable to detect remote migration address\n" if !$raddr;
>
> - ## create tunnel to remote port
> - $self->log('info', "starting ssh migration tunnel");
> - my $pfamily = PVE::Tools::get_host_address_family($nodename);
> - my $lport = ($raddr eq "localhost") ? PVE::Tools::next_migrate_port($pfamily) : undef;
> - $self->{tunnel} = $self->fork_tunnel($self->{nodeip}, $lport, $rport);
> + if ($ruri =~ /^unix:/) {
> + ## create tunnel to remote port
> + $self->log('info', "start remote tunnel");
> + $self->{tunnel} = $self->fork_tunnel($raddr);
> +
> + my $unix_socket_try = 0; # wait for the socket to become ready
> + while (! -S "/run/qemu-server/$vmid.migrate") {
> + $unix_socket_try++;
> + if ($unix_socket_try > 100) {
> + $self->{errors} = 1;
> + $self->finish_tunnel();
> + die "Timeout, migration socket $ruri did not get ready";
> + }
> +
> + usleep(10000);
> + }
> + }
>
> my $start = time();
> - $self->log('info', "starting online/live migration on $raddr:$rport");
> + $self->log('info', "starting online/live migration on $ruri");
> $self->{livemigration} = 1;
>
> # load_defaults
> @@ -438,10 +413,10 @@ sub phase2 {
> }
>
> eval {
> - PVE::QemuServer::vm_mon_cmd_nocheck($vmid, "migrate", uri => "tcp:$raddr:$rport");
> + PVE::QemuServer::vm_mon_cmd_nocheck($vmid, "migrate", uri => $ruri);
> };
> my $merr = $@;
> - $self->log('info', "migrate uri => tcp:$raddr:$rport failed: $merr") if $merr;
> + $self->log('info', "migrate uri => $ruri failed: $merr") if $merr;
>
> my $lstat = 0;
> my $usleep = 2000000;
> @@ -538,13 +513,10 @@ sub phase2 {
> die "unable to parse migration status '$stat->{status}' - aborting\n";
> }
> }
> - #to be sure tat the tunnel is closed
> +
> + # just to be sure that the tunnel always gets closed
> if ($self->{tunnel}) {
> - eval { finish_tunnel($self, $self->{tunnel}); };
> - if (my $err = $@) {
> - $self->log('err', $err);
> - $self->{errors} = 1;
> - }
> + finish_tunnel($self);
> }
> }
>
> @@ -580,11 +552,7 @@ sub phase2_cleanup {
> }
>
> if ($self->{tunnel}) {
> - eval { finish_tunnel($self, $self->{tunnel}); };
> - if (my $err = $@) {
> - $self->log('err', $err);
> - $self->{errors} = 1;
> - }
> + finish_tunnel($self});
Ugh, send wrong file which has the syntax problematic "}" here still,
sorry I'll resend
(and need to more strictly cleanup my format-patch outputs directly... :/).
> }
> }
>
> diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
> index 33779b3..dd94de4 100644
> --- a/PVE/QemuServer.pm
> +++ b/PVE/QemuServer.pm
> @@ -4302,21 +4302,27 @@ sub vm_start {
>
> my ($cmd, $vollist, $spice_port) = config_to_command($storecfg, $vmid, $conf, $defaults, $forcemachine);
>
> - my $migrate_port = 0;
> my $migrate_uri;
> if ($statefile) {
> if ($statefile eq 'tcp') {
> - my $localip = "localhost";
> + # default to secure migrations: use ssh over a socat managed UNIX socket
> + # pair, as we a ssh forwards tunnel is not deterministic reliable ready
> + $migrate_uri = "unix:/run/qemu-server/$vmid.migrate";
> +
> my $datacenterconf = PVE::Cluster::cfs_read_file('datacenter.cfg');
> - my $nodename = PVE::INotify::nodename();
> if ($datacenterconf->{migration_unsecure}) {
> - $localip = PVE::Cluster::remote_node_ip($nodename, 1);
> - $localip = "[$localip]" if Net::IP::ip_is_ipv6($localip);
> + my $nodename = PVE::INotify::nodename();
> + my $localip = PVE::Cluster::remote_node_ip($nodename, 1);
> + $localip = "[$localip]" if Net::IP::ip_is_ipv6($localip);
> +
> + my $pfamily = PVE::Tools::get_host_address_family($nodename);
> + my $migrate_port = PVE::Tools::next_migrate_port($pfamily) || 0;
> +
> + $migrate_uri = "tcp:${localip}:${migrate_port}";
> }
> - my $pfamily = PVE::Tools::get_host_address_family($nodename);
> - $migrate_port = PVE::Tools::next_migrate_port($pfamily);
> - $migrate_uri = "tcp:${localip}:${migrate_port}";
> +
> push @$cmd, '-incoming', $migrate_uri;
> +
> push @$cmd, '-S';
> } else {
> push @$cmd, '-loadstate', $statefile;
More information about the pve-devel
mailing list