[pve-devel] [PATCH ha-manager v4 1/3] avoid out of sync command execution in LRM
Thomas Lamprecht
t.lamprecht at proxmox.com
Tue Feb 23 16:15:25 CET 2016
We are only allowed to execute any command once as else we
may disturb the synchrony between CRM and LRM.
The following scenario could happen:
schedule CRM: deploy task 'migrate' for service vm:100 with UID 1234
schedule LRM: fork task wit UID 123
schedule CRM: idle as no result available yet
schedule LRM: collect finished task UID and write result for CRM
Result was an error
schedule LRM: fork task UID _AGAIN_
schedule CRM: processes _previous_ erroneous result from LRM
and place vm:100 in started state on source node
schedule LRM: collect finished task UID and write result for CRM
This time the task was successful and service is on
target node, but the CRM know _nothing_ of it!
schedule LRM: try to schedule task but service is not on this node!
=> error
To fix that we _never_ execute two exactly same migrate commands
after each other, exactly means here that the UID of the actual
command to queue matches the last _finished_ command.
This enforces the originally wanted SYN - ACK behaviour between CRM
and LRMs.
We generate now a new UID for services who does not change state
the one of the following evaluates to true:
* disabled AND in stopped state
* enabled AND in started state
This ensures that the state from the CRM holds in the LRM and thus,
for example, a killed VM gets restarted.
Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
---
changes since v3:
* do not restart any task with same UID not only the migrate/relocate once
* with this the behaviour of the error state change also so we only log once
when a service gets placed in the error state.
* remove logging as it spams only the log with no use
* the new UID computation, described in the commit message
* added return after change to error state in 'next_state_started'
to avoid a possible false state changes out of the error state
when select_service_node returns another node
src/PVE/HA/LRM.pm | 13 +++++++++++++
src/PVE/HA/Manager.pm | 17 +++++++++++------
src/test/test-resource-failure5/log.expect | 2 --
3 files changed, 24 insertions(+), 8 deletions(-)
diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm
index f53f26d..f0e7977 100644
--- a/src/PVE/HA/LRM.pm
+++ b/src/PVE/HA/LRM.pm
@@ -31,6 +31,8 @@ sub new {
workers => {},
results => {},
restart_tries => {},
+ # store finished jobs UID so we don't exec the same twice
+ processed_uids => {},
shutdown_request => 0,
shutdown_errors => 0,
# mode can be: active, reboot, shutdown, restart
@@ -413,6 +415,7 @@ sub run_workers {
$haenv->log('err', $err);
}
if (defined($w->{uid})) {
+ $self->{processed_uids}->{$sid} = $w->{uid};
$self->resource_command_finished($sid, $w->{uid}, $res);
} else {
$self->stop_command_finished($sid, $res);
@@ -455,6 +458,15 @@ sub manage_resources {
sub queue_resource_command {
my ($self, $sid, $uid, $state, $target) = @_;
+ my $haenv = $self->{haenv};
+ my $processed = $self->{processed_uids};
+
+ # do not queue the excatly same command twice as this may lead to
+ # an inconsistent HA state when the first command fails but the CRM
+ # does not process its failure right away and the LRM starts a second
+ # try, without the CRM knowing of it (race condition)
+ return if ($uid && $processed->{$sid} && $uid eq $processed->{$sid});
+
if (my $w = $self->{workers}->{$sid}) {
return if $w->{pid}; # already started
# else, delete and overwrite queue entry with new command
@@ -482,6 +494,7 @@ sub check_active_workers {
my $waitpid = waitpid($pid, WNOHANG);
if (defined($waitpid) && ($waitpid == $pid)) {
if (defined($w->{uid})) {
+ $self->{processed_uids}->{$sid} = $w->{uid};
$self->resource_command_finished($sid, $w->{uid}, $?);
} else {
$self->stop_command_finished($sid, $?);
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index d0031e7..2e5194f 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -517,12 +517,14 @@ sub next_state_stopped {
} else {
$haenv->log('err', "unknown command '$cmd' for service '$sid'");
}
- }
+ }
if ($cd->{state} eq 'disabled') {
- # do nothing
+ # tell LRM we are in a synced state
+ # ensure that it stays stopped
+ $sd->{uid} = compute_new_uuid($sd->{state});
return;
- }
+ }
if ($cd->{state} eq 'enabled') {
# simply mark it started, if it's on the wrong node
@@ -582,7 +584,8 @@ sub next_state_started {
} elsif ($ec == ETRY_AGAIN) {
- # do nothing, the LRM wants to try again
+ # tell LRM we got the result and that he can try again
+ $sd->{uid} = compute_new_uuid($sd->{state});
} elsif ($ec == ERROR) {
# apply our relocate policy if we got ERROR from the LRM
@@ -612,6 +615,7 @@ sub next_state_started {
" (exit code $ec))");
# we have no save way out (yet) for other errors
&$change_service_state($self, $sid, 'error');
+ return;
}
}
@@ -627,12 +631,13 @@ sub next_state_started {
&$change_service_state($self, $sid, 'relocate', node => $sd->{node}, target => $node);
}
} else {
- # do nothing
+ # ensure service get started again if it went unexpected down
+ $sd->{uid} = compute_new_uuid($sd->{state});
}
}
return;
- }
+ }
$haenv->log('err', "service '$sid' - unknown state '$cd->{state}' in service configuration");
}
diff --git a/src/test/test-resource-failure5/log.expect b/src/test/test-resource-failure5/log.expect
index b6e7807..283ca8c 100644
--- a/src/test/test-resource-failure5/log.expect
+++ b/src/test/test-resource-failure5/log.expect
@@ -31,8 +31,6 @@ err 143 node2/lrm: unable to start service fa:130 on local node after 1 r
err 160 node1/crm: recovery policy for service fa:130 failed, entering error state!
info 160 node1/crm: service 'fa:130': state changed from 'started' to 'error'
warn 163 node2/lrm: service fa:130 is not running and in an error state
-warn 183 node2/lrm: service fa:130 is not running and in an error state
-warn 203 node2/lrm: service fa:130 is not running and in an error state
info 220 cmdlist: execute service fa:130 disabled
info 220 node1/crm: service 'fa:130': state changed from 'error' to 'stopped'
info 820 hardware: exit simulation - done
--
2.1.4
More information about the pve-devel
mailing list