[pve-devel] [PATCH ha-manager 3/4] fix race condition on slow resource commands

Fri Sep 9 16:15:36 CEST 2016

When we fixed the dangling state machine - where one command request
from the CRM could result in multiple executes of said command by
the LRM - by ensuring in the LRM that UID identified actions get
only started once per UID (except the stop comand) we introduced a
bug which can result in a lost LRM result from an failed service
start.

The reason for this is that we generated a new UID for the started
state every CRM turn, so that a service gets restarted if it
crashes. But as we do this without checking if the LRM has finished
our last request we may loose the result of this last request.
As an example consider the following timeline of events:
1. CRM request start of Service 'ct:100'
2. LRM starts this request, needs a bit longer
3. Before LRM worker finishes the CRM does an iteration and
   generates a new UID for this service
4. The LRM worker finishes but cannot write back its result as
   the UID doesn't exists anymore in the managers service status.
5. The CRM gets another round and generates a new UID for 'ct:100'
6. The cycle begins again, the LRM always throws away its last
   result as the CRM wrongfully generated an new command UID
   before allowing the LRM to write back its result.

This loss of the result is problematic if it was an erroneous one,
because then it result in a malfunction of the failure restart and
relocate policies.

Fix this by checking in the CRM if the last command was processed by
the LRM. This needs only be done in the next_state_started state
as this is the only one which may generate a new UID without
processing the LRM result in the same time.

Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
---
 src/PVE/HA/Manager.pm | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 2fa8a2d..0198154 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -25,6 +25,8 @@ sub new {
     # fixme: use separate class  PVE::HA::ServiceStatus
     $self->{ss} = $old_ms->{service_status} || {};
 
+    $self->{processed_lrm_uids} = $old_ms->{processed_lrm_uids} || {};
+
     $self->{ms} = { master_node => $haenv->nodename() };
 
     return $self;
@@ -43,6 +45,7 @@ sub flush_master_status {
 
     $ms->{node_status} = $ns->{status};
     $ms->{service_status} = $ss;
+    $ms->{processed_lrm_uids} = $self->{processed_lrm_uids};
     $ms->{timestamp} = $haenv->get_time();
     
     $haenv->write_manager_status($ms);
@@ -619,6 +622,8 @@ sub next_state_started {
 
 	    if ($lrm_res) {
 
+		$self->{processed_lrm_uids}->{$sd->{uid}} = 1;
+
 		my $ec = $lrm_res->{exit_code};
 		if ($ec == SUCCESS) {
 
@@ -679,7 +684,12 @@ sub next_state_started {
 		                "Tried nodes: " . join(', ', @{$sd->{failed_nodes}}));
 		}
 		# ensure service get started again if it went unexpected down
-		$sd->{uid} = compute_new_uuid($sd->{state});
+		# but ensure also no LRM result gets lost
+		my $old_uid = $sd->{uid};
+		if ($self->{processed_lrm_uids}->{$old_uid}) {
+		    $sd->{uid} = compute_new_uuid($sd->{state})
+		    delete $self->{processed_lrm_uids}->{$old_uid};
+		}
 	    }
 	}
 
-- 
2.1.4