[pve-devel] qemu ha migration : race between move file and resume vm

Thomas Lamprecht t.lamprecht at proxmox.com
Wed Oct 14 08:03:02 CEST 2015



On 10/14/2015 07:40 AM, Alexandre DERUMIER wrote:
> Hi,
> 2 users have reported a migration problem when ha is enabled
> http://forum.proxmox.com/threads/23848-PVE-4-KVM-live-migration-problem
>
> I'm also enable to reproduce it
>
> task log
> ---------
> task started by HA resource agent
> Oct 14 07:27:48 starting migration of VM 125 to node 'kvmtest2' (10.3.94.47)
> Oct 14 07:27:48 copying disk images
> Oct 14 07:27:48 starting VM 125 on remote node 'kvmtest2'
> Oct 14 07:27:49 starting ssh migration tunnel
> Oct 14 07:27:51 starting online/live migration on 10.3.94.47:60000
> Oct 14 07:27:51 migrate_set_speed: 8589934592
> Oct 14 07:27:51 migrate_set_downtime: 0.1
> Oct 14 07:27:53 migration speed: 64.00 MB/s - downtime 7 ms
> Oct 14 07:27:53 migration status: completed
> Oct 14 07:27:54 ERROR: unable to find configuration file for VM 125 - no such machine
> Oct 14 07:27:54 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root at 10.3.94.47 qm resume 125 --skiplock' failed: exit code 2
> Oct 14 07:27:57 ERROR: migration finished with problems (duration 00:00:09)
> TASK ERROR: migration problems
>
>
>
> The problem is in QemuMigrate.pm,
> in phase3 cleanup
>
>
>      die "Failed to move config to node '$self->{node}' - rename failed: $!\n"
>          if !rename($conffile, $newconffile);
>
>      if ($self->{livemigration}) {
>          # now that config file is move, we can resume vm on target if livemigrate
>          my $cmd = [@{$self->{rem_ssh}}, 'qm', 'resume', $vmid, '--skiplock'];
>          eval{ PVE::Tools::run_command($cmd, outfunc => sub {},
>                  errfunc => sub {
>                      my $line = shift;
>                      $self->log('err', $line);
>                  });
>          };
>          if (my $err = $@) {
>              $self->log('err', $err);
>              $self->{errors} = 1;
>          }
>      }
>
>
>
> The move file is done on source node,
> but the target node don't see the moved file until around 3s, so the resume is dying.
>
>
> I don't known how HA is related here ? maybe some kind of file lock ?
No, HA does not lock the config file, it more or less  makes an API call 
to Qemu->migrate, like:

>     my $upid = PVE::API2::Qemu->migrate_vm($params);
>     $haenv->upid_wait($upid);
with the params:
>     my $params = {
>         node => $nodename,
>         vmid => $vmid,
>         target => $target,
>         online => 1,
>     };
This happens in an forked process which then waits until completion of 
the task.

The HA manager moves the config only when the VM is offline and gets an 
migrate command, which shouldn't be the case here :)

>
> _______________________________________________
> pve-devel mailing list
> pve-devel at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>





More information about the pve-devel mailing list