[pve-devel] [PATCH] migrate : add nocheck for resume

Alexandre DERUMIER aderumier at odiso.com
Thu Oct 15 07:09:08 CEST 2015


>>I'll try to make the HA "immune" against such errors, but the real bug 
>>isn't in the HA stack :)

Yes, I think we could send different error code for errors in phase3 of live migration.
Maybe a warning instead an error for this phase.

in AbstractMigrate.pm


    if ($err) {
        $self->log('err', "migration aborted (duration $duration): $err");
        die "migration aborted\n";
    }

    if ($self->{errors}) {
        $self->log('err', "migration finished with problems (duration $duration)");
        die "migration problems\n"
    }

    $self->log('info', "migration finished successfully (duration $duration)");


maybe add an 

    if ($self->{warning}) {
        $self->log('warning', "migration finished with problems (duration $duration)");
        warn "migration problems\n";
        return;
    }


----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht at proxmox.com>
À: "pve-devel" <pve-devel at pve.proxmox.com>
Envoyé: Mercredi 14 Octobre 2015 22:22:59
Objet: Re: [pve-devel] [PATCH] migrate : add nocheck for resume

Am 14.10.2015 um 20:35 schrieb Alexandre DERUMIER: 
>>> another problem, 
>>> 
>>> I also have hitted the bug again, and just after, I can't migrate the vm anymore, 
>>> 
>>> the HA migrate task start, but after that, the migrate task don't occur. 
>>> Oct 14 19:04:33 kvmtest2 pve-ha-lrm[28430]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389. 
> I'm not sure, but maybe this is because the migration error occur on phase3, 
> when the vm is already migrate. (in this phase,it should be more warning than error). 
> 
> So, ha manager think that the migration has failed, and vm is still on previous host node 
> 
> ? 
Yes, that should be the case. 
When the error propagates through to the exec_resource_agent function 
the migration will be seen as failed, the crm places the service again 
in the started state (as it assumes the vm hasn't moved but the config 
isn't there) 
now the lrm on the old node tries to start the service as it isn't 
running and now the error happens because the start command gets 
executed on the wrong node so this check fails: 
> die "service '$sid' not on this node" if $service_config->{node} ne 
> $nodename; 
I'll try to make the HA "immune" against such errors, but the real bug 
isn't in the HA stack :) 
> 
> ----- Mail original ----- 
> De: "aderumier" <aderumier at odiso.com> 
> À: "pve-devel" <pve-devel at pve.proxmox.com> 
> Envoyé: Mercredi 14 Octobre 2015 19:07:49 
> Objet: Re: [pve-devel] [PATCH] migrate : add nocheck for resume 
> 
> I have made test, with a loop of move file each second, 
> 
> and monitor the time between source and target. 
> 
> the results are between 10ms and 300ms, with spikes up to 1s, 
> 
> so this can explain the race. 
> 
> (I can't explain the speed difference and spike) 
> 
> 
> another problem, 
> 
> I also have hitted the bug again, and just after, I can't migrate the vm anymore, 
> 
> the HA migrate task start, but after that, the migrate task don't occur. 
> 
> 
> pve-ha-crm log flood me in loop: 
> 
> Oct 14 19:01:16 kvmtest1 pve-ha-crm[3819]: service 'vm:125': state changed from 'migrate' to 'started' (node = kvmtest2) 
> Oct 14 19:01:16 kvmtest1 pve-ha-crm[3819]: migrate service 'vm:125' to node 'kvmtest1' (running) 
> Oct 14 19:01:16 kvmtest1 pve-ha-crm[3819]: service 'vm:125': state changed from 'started' to 'migrate' (node = kvmtest2, target = kvmtest1) 
> Oct 14 19:01:26 kvmtest1 pve-ha-crm[3819]: service 'vm:125' - migration failed (exit code 255) 
> Oct 14 19:01:26 kvmtest1 pve-ha-crm[3819]: service 'vm:125': state changed from 'migrate' to 'started' (node = kvmtest2) 
> Oct 14 19:01:26 kvmtest1 pve-ha-crm[3819]: migrate service 'vm:125' to node 'kvmtest1' (running) 
> Oct 14 19:01:26 kvmtest1 pve-ha-crm[3819]: service 'vm:125': state changed from 'started' to 'migrate' (node = kvmtest2, target = kvmtest1) 
> 
> 
> Oct 14 19:04:33 kvmtest2 pve-ha-lrm[28430]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389. 
> Oct 14 19:04:43 kvmtest2 pve-ha-lrm[28451]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389. 
> Oct 14 19:04:53 kvmtest2 pve-ha-lrm[28472]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389. 
> Oct 14 19:05:03 kvmtest2 pve-ha-lrm[28493]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389. 
> Oct 14 19:05:13 kvmtest2 pve-ha-lrm[28520]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389. 
> Oct 14 19:05:23 kvmtest2 pve-ha-lrm[28541]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389. 
> Oct 14 19:05:33 kvmtest2 pve-ha-lrm[28562]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389. 
> Oct 14 19:05:43 kvmtest2 pve-ha-lrm[28583]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389. 
> Oct 14 19:05:53 kvmtest2 pve-ha-lrm[28604]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389. 
> Oct 14 19:06:03 kvmtest2 pve-ha-lrm[28626]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389. 
> 
> 
> ----- Mail original ----- 
> De: "aderumier" <aderumier at odiso.com> 
> À: "dietmar" <dietmar at proxmox.com> 
> Cc: "pve-devel" <pve-devel at pve.proxmox.com> 
> Envoyé: Mercredi 14 Octobre 2015 16:17:24 
> Objet: Re: [pve-devel] [PATCH] migrate : add nocheck for resume 
> 
>>> To be sure, I would also test with my direct_io patch for fuse... 
> yes, I'm currently using it. 
> 
> I have make a simple perl script which monitor create/delete vm conf file, 
> and time are indeed correct vs notify 
> 
> 
> node1 
> ----- 
> 
> exist 20151014 16:14:06.183 
> notexist20151014 16:14:38.989 
> exist20151014 16:15:07.066 
> 
> node2 
> ----- 
> notexist2 0151014 16:14:06.208 
> exist 20151014 16:14:39.003 
> notexist 20151014 16:15:07.089 
> 
> 
> I'll try to reproduce the problem and compare time again 
> 
> ----- Mail original ----- 
> De: "dietmar" <dietmar at proxmox.com> 
> À: "aderumier" <aderumier at odiso.com> 
> Cc: "pve-devel" <pve-devel at pve.proxmox.com> 
> Envoyé: Mercredi 14 Octobre 2015 16:00:28 
> Objet: Re: [pve-devel] [PATCH] migrate : add nocheck for resume 
> 
>> http://search.cpan.org/~andya/File-Monitor-1.00/lib/File/Monitor.pm 
>> 
>> which used stat() to detect changes 
> _______________________________________________ 
> pve-devel mailing list 
> pve-devel at pve.proxmox.com 
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 
> _______________________________________________ 
> pve-devel mailing list 
> pve-devel at pve.proxmox.com 
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 
> _______________________________________________ 
> pve-devel mailing list 
> pve-devel at pve.proxmox.com 
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel at pve.proxmox.com 
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


More information about the pve-devel mailing list