[pve-devel] [PATCH ha-manager v3 3/6] fix possible out of sync state on migrations

Fri Feb 19 18:41:04 CET 2016

Description of the problem, imagine the following:
We get the CRM command to migrate 'vm:100' from A to B.
Now when the migration fails, we would normally get placed in the
started state on the source node A trough the CRM when it processes
our result.
But if the CRM didn't processed our result before we start a new
'manage_resources' round in the LRM then it could be that the LRM
restarts a migration try with the CRM not knowing anything of it,
and worse the CRM may process the result of the failed migration
try at the same time and place it to started on node A while
the LRM now successfully migrated the service to B with the second
(hidden) try. Now the state is out of sync:

CRMi then has the service marked as started on node A but it runs
on node B. We (currently) have no way to fixup a wrong node location
of a _running_ service, thus the LRM from node A errors in
EWRONG_NODE and the CRM places the service in the error state.

To fix that we _never_ execute two exactly same migrate commands
after each other, exactly means here that the UID of the actual
command to queue matches the last _finished_ command, while the
command is either 'migrate' or 'relocate'.

Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
---
 src/PVE/HA/LRM.pm | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm
index f53f26d..7b7dcc5 100644
--- a/src/PVE/HA/LRM.pm
+++ b/src/PVE/HA/LRM.pm
@@ -31,6 +31,8 @@ sub new {
 	workers => {},
 	results => {},
 	restart_tries => {},
+	# store each (finished) jobs' uid so we do not exec error prone ones twice
+	processed_uids => {},
 	shutdown_request => 0,
 	shutdown_errors => 0,
 	# mode can be: active, reboot, shutdown, restart
@@ -359,6 +361,7 @@ sub do_one_iteration {
     return 1;
 }
 
+
 sub run_workers {
     my ($self) = @_;
 
@@ -455,6 +458,22 @@ sub manage_resources {
 sub queue_resource_command {
     my ($self, $sid, $uid, $state, $target) = @_;
 
+    if (($state eq 'migrate' || $state eq 'relocate') &&
+	$self->{processed_uids}->{$sid} &&
+	$uid eq $self->{processed_uids}->{$sid}) {
+
+	# do not queue the same migration/relocation command twice as this may
+	# lead to an inconsistent HA state when the first command fails but the
+	# CRM does not process it right away and the LRM starts a second,
+	# successful, try which succeeds while the CRM processes the failed one
+	# and places the SID as started here while its already on the target
+	# node (resulting in EWRONG_NODE)
+	$self->{haenv}->log('notice', "Service '$sid': ignore retrying '$state'".
+			    " command, UID '$uid'. CRM didn't got its result yet.");
+	return;
+
+    }
+
     if (my $w = $self->{workers}->{$sid}) {
 	return if $w->{pid}; # already started
 	# else, delete and overwrite queue entry with new command
@@ -482,6 +501,7 @@ sub check_active_workers {
 	    my $waitpid = waitpid($pid, WNOHANG);
 	    if (defined($waitpid) && ($waitpid == $pid)) {
 		if (defined($w->{uid})) {
+		    $self->{processed_uids}->{$sid} = $w->{uid};
 		    $self->resource_command_finished($sid, $w->{uid}, $?);
 		} else {
 		    $self->stop_command_finished($sid, $?);
-- 
2.1.4