[pve-devel] [PATCH ha-manager 1/4] fix possible out of sync state on migrations
Thomas Lamprecht
t.lamprecht at proxmox.com
Tue Feb 16 16:46:42 CET 2016
Description of the problem, imagine the following:
We get the CRM command to migrate 'vm:100' from A to B.
Now the migration fails, now normally we would get placed in the
started state on the source node A from the CRM when it processes
our result.
But if the CRM didn't processed our result before we start a new
'manage_resources' round (we do that about all ~ 5 seconds) then
it could be that the LRM restarts a migration try with the CRM not
knowing anything and worse the CRM may process the result of the
failed migration try at the same time and place it to started on
node A while the LRM now successfully migrated the service to B
with the second (hidden) try. Now the state is out of sync:
CRM has the service marked as started on node A but it runs on node
B. We (currently) have no way to fixup a wrong node location of a
_running_ service, thus the LRM from node A errors in EWRONG_NODE
and the CRM places the service in the error state.
To fix that we _never_ execute two exactly same migrate commands
after each other, exactly means the sid and the target are the same.
Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
---
src/PVE/HA/LRM.pm | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm
index f53f26d..060ae9d 100644
--- a/src/PVE/HA/LRM.pm
+++ b/src/PVE/HA/LRM.pm
@@ -457,6 +457,14 @@ sub queue_resource_command {
if (my $w = $self->{workers}->{$sid}) {
return if $w->{pid}; # already started
+ if ($state eq 'migrate' && $w->{state} eq $state && $w->{target} eq $target) {
+ # ignore two identical migration tries directly after each other
+ # as this means that the CRM didn't got our result yet and a
+ # second double migration tries are dangerous (EWRONG_NODE)!
+ $self->{haenv}->log('notice', "Ignore second identical migration call," .
+ " CRM didn't processed our last result yet.");
+ return;
+ }
# else, delete and overwrite queue entry with new command
delete $self->{workers}->{$sid};
}
--
2.1.4
More information about the pve-devel
mailing list