[PVE-User] HA migration behaviour vs. failures

Thu Jul 24 11:16:45 CEST 2014

@Proxmox devs :

Sorry for being a bit cheeky here, but why did you change the way RedHat 
cluster behaves with online migration ? (migration failure is a 
non-critical error)
I guess there must be a good reason, but still i'm interested to know 
why. :$

Regards,
--

Alexandre DHAUSSY

Le 22/07/2014 18:30, Dhaussy Alexandre a écrit :

> Greetings,
>
> I've been "playing" with the last version of proxmox (3 nodes cluster + glusterfs) for a couple of month.
> My goal is to replace 3 RedHat 5 KVM servers (no HA) hosting ~100 VMs on NAS storage.
>
> But i have some annoying issues with live migrations..
> Sometimes it will work, but sometimes (with no reason) it won't.
> When it fails (migration aborted), i try again then it works ! :(
>
> Jul 11 14:48:49 starting ssh migration tunnel
> Jul 11 14:48:50 starting online/live migration on localhost:60000
> Jul 11 14:48:50 migrate_set_speed: 8589934592
> Jul 11 14:48:50 migrate_set_downtime: 0.1
> Jul 11 14:48:52 ERROR: online migrate failure - aborting
> Jul 11 14:48:52 aborting phase 2 - cleanup resources
> Jul 11 14:48:52 migrate_cancel
> Jul 11 14:58:52 ERROR: migration finished with problems (duration 00:10:05)
> TASK ERROR: migration problems
>
> I tried to :
> - disable spice.
> - set cpu to 'default' (kvm64) instead of 'host'.
> - shared storage 'Directory' (fuse mount.) instead of 'GlusterFS'.
> But no luck, still random failures.
>
> My problem with that is when the VMs will be added to HA cluster...because proxmox seems to stop the service when live migration fails.
> I can't see why someone would wan't to stop a HA VM, because live migration fails but the VM is still running ?
>
> I remember i have another cluster here (2 nodes RedHat 6 KVM cluster, VMs with HA) and when ha migration fails VMs stay started on the original node.
> I thought it would be then possible to achieve the same behaviour with Proxmox ?
>
> Having the VMs stopped in a HA cluster is a no go, so i ended doing some nasty changes in the code.
> I'm still interested in a better solution, so far it seems to do what i need..
>
> +++ /usr/share/cluster/pvevm    2014-07-22 15:22:29.703424516 +0200
> @@ -28,6 +28,7 @@
>    use constant OCF_NOT_RUNNING => 7;
>    use constant OCF_RUNNING_MASTER => 8;
>    use constant OCF_FAILED_MASTER => 9;
> +use constant OCF_ERR_MIGRATE => 150;
>
>    $ENV{'PATH'} = '/sbin:/bin:/usr/sbin:/usr/bin';
>
> @@ -358,6 +359,9 @@
>
>        upid_wait($upid);
>
> +    check_running($status);
> +    exit(OCF_ERR_MIGRATE) if $status->{running};
> +
>        # something went wrong if old config file is still there
>        exit((-f $oldconfig) ? OCF_ERR_GENERIC : OCF_SUCCESS);
>
> +++ /usr/share/perl5/PVE/API2/Qemu.pm   2014-07-22 15:51:31.909558803 +0200
> @@ -1634,7 +1634,7 @@
>
>           my $storecfg = PVE::Storage::config();
>
> -       if (&$vm_is_ha_managed($vmid) && $rpcenv->{type} ne 'ha') {
> +       if (&$vm_is_ha_managed($vmid) && $rpcenv->{type} ne 'ha' && !defined($migratedfrom)) {
>
>               my $hacmd = sub {
>                   my $upid = shift;
>
>
> Regards,
>