[pve-devel] [PATCH] migrate : add nocheck for resume
Thomas Lamprecht
t.lamprecht at proxmox.com
Wed Oct 14 22:22:59 CEST 2015
Am 14.10.2015 um 20:35 schrieb Alexandre DERUMIER:
>>> another problem,
>>>
>>> I also have hitted the bug again, and just after, I can't migrate the vm anymore,
>>>
>>> the HA migrate task start, but after that, the migrate task don't occur.
>>> Oct 14 19:04:33 kvmtest2 pve-ha-lrm[28430]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
> I'm not sure, but maybe this is because the migration error occur on phase3,
> when the vm is already migrate. (in this phase,it should be more warning than error).
>
> So, ha manager think that the migration has failed, and vm is still on previous host node
>
> ?
Yes, that should be the case.
When the error propagates through to the exec_resource_agent function
the migration will be seen as failed, the crm places the service again
in the started state (as it assumes the vm hasn't moved but the config
isn't there)
now the lrm on the old node tries to start the service as it isn't
running and now the error happens because the start command gets
executed on the wrong node so this check fails:
> die "service '$sid' not on this node" if $service_config->{node} ne
> $nodename;
I'll try to make the HA "immune" against such errors, but the real bug
isn't in the HA stack :)
>
> ----- Mail original -----
> De: "aderumier" <aderumier at odiso.com>
> À: "pve-devel" <pve-devel at pve.proxmox.com>
> Envoyé: Mercredi 14 Octobre 2015 19:07:49
> Objet: Re: [pve-devel] [PATCH] migrate : add nocheck for resume
>
> I have made test, with a loop of move file each second,
>
> and monitor the time between source and target.
>
> the results are between 10ms and 300ms, with spikes up to 1s,
>
> so this can explain the race.
>
> (I can't explain the speed difference and spike)
>
>
> another problem,
>
> I also have hitted the bug again, and just after, I can't migrate the vm anymore,
>
> the HA migrate task start, but after that, the migrate task don't occur.
>
>
> pve-ha-crm log flood me in loop:
>
> Oct 14 19:01:16 kvmtest1 pve-ha-crm[3819]: service 'vm:125': state changed from 'migrate' to 'started' (node = kvmtest2)
> Oct 14 19:01:16 kvmtest1 pve-ha-crm[3819]: migrate service 'vm:125' to node 'kvmtest1' (running)
> Oct 14 19:01:16 kvmtest1 pve-ha-crm[3819]: service 'vm:125': state changed from 'started' to 'migrate' (node = kvmtest2, target = kvmtest1)
> Oct 14 19:01:26 kvmtest1 pve-ha-crm[3819]: service 'vm:125' - migration failed (exit code 255)
> Oct 14 19:01:26 kvmtest1 pve-ha-crm[3819]: service 'vm:125': state changed from 'migrate' to 'started' (node = kvmtest2)
> Oct 14 19:01:26 kvmtest1 pve-ha-crm[3819]: migrate service 'vm:125' to node 'kvmtest1' (running)
> Oct 14 19:01:26 kvmtest1 pve-ha-crm[3819]: service 'vm:125': state changed from 'started' to 'migrate' (node = kvmtest2, target = kvmtest1)
>
>
> Oct 14 19:04:33 kvmtest2 pve-ha-lrm[28430]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
> Oct 14 19:04:43 kvmtest2 pve-ha-lrm[28451]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
> Oct 14 19:04:53 kvmtest2 pve-ha-lrm[28472]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
> Oct 14 19:05:03 kvmtest2 pve-ha-lrm[28493]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
> Oct 14 19:05:13 kvmtest2 pve-ha-lrm[28520]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
> Oct 14 19:05:23 kvmtest2 pve-ha-lrm[28541]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
> Oct 14 19:05:33 kvmtest2 pve-ha-lrm[28562]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
> Oct 14 19:05:43 kvmtest2 pve-ha-lrm[28583]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
> Oct 14 19:05:53 kvmtest2 pve-ha-lrm[28604]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
> Oct 14 19:06:03 kvmtest2 pve-ha-lrm[28626]: service 'vm:125' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
>
>
> ----- Mail original -----
> De: "aderumier" <aderumier at odiso.com>
> À: "dietmar" <dietmar at proxmox.com>
> Cc: "pve-devel" <pve-devel at pve.proxmox.com>
> Envoyé: Mercredi 14 Octobre 2015 16:17:24
> Objet: Re: [pve-devel] [PATCH] migrate : add nocheck for resume
>
>>> To be sure, I would also test with my direct_io patch for fuse...
> yes, I'm currently using it.
>
> I have make a simple perl script which monitor create/delete vm conf file,
> and time are indeed correct vs notify
>
>
> node1
> -----
>
> exist 20151014 16:14:06.183
> notexist20151014 16:14:38.989
> exist20151014 16:15:07.066
>
> node2
> -----
> notexist2 0151014 16:14:06.208
> exist 20151014 16:14:39.003
> notexist 20151014 16:15:07.089
>
>
> I'll try to reproduce the problem and compare time again
>
> ----- Mail original -----
> De: "dietmar" <dietmar at proxmox.com>
> À: "aderumier" <aderumier at odiso.com>
> Cc: "pve-devel" <pve-devel at pve.proxmox.com>
> Envoyé: Mercredi 14 Octobre 2015 16:00:28
> Objet: Re: [pve-devel] [PATCH] migrate : add nocheck for resume
>
>> http://search.cpan.org/~andya/File-Monitor-1.00/lib/File/Monitor.pm
>>
>> which used stat() to detect changes
> _______________________________________________
> pve-devel mailing list
> pve-devel at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> _______________________________________________
> pve-devel mailing list
> pve-devel at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> _______________________________________________
> pve-devel mailing list
> pve-devel at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
More information about the pve-devel
mailing list