[pve-devel] [PATCH ha-manager v7 1/4] allow LRM lock stealing for fenced nodes

Thomas Lamprecht t.lamprecht at proxmox.com
Fri Aug 5 14:43:17 CEST 2016


We are only allowed to recover (=steal) a service when we have its
LRMs lock, as this guarantees us that even if said LRM comes up
again during the steal operation the LRM cannot start the services
when the service config still belongs to it for a short time.

The possible situations can be:
The was fenced and is rendered unable to do anything, here we
may just take its service and recover it

The node is still in an unknown state, and may be online, so
a) It has quorum and so sees that it hasn't its lock anymore
   and thus it will not do anything, this is the situation we
   cover here.
b) has no quorum and thus will never do anything and if it comes
   online at any time its lock is away and we are on the secure site

This is important, else we have a possible race for the resource
which can result in a service started on the old (restarted) node
and the node where the service was recovered too, which is really
bad!

Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
---
 src/PVE/HA/Env.pm      |  4 ++--
 src/PVE/HA/Env/PVE2.pm |  4 ++--
 src/PVE/HA/Sim/Env.pm  | 16 ++++++++++------
 3 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/src/PVE/HA/Env.pm b/src/PVE/HA/Env.pm
index 55f6684..5e98bda 100644
--- a/src/PVE/HA/Env.pm
+++ b/src/PVE/HA/Env.pm
@@ -165,9 +165,9 @@ sub get_ha_agent_lock {
 # this should only get called if the nodes LRM gracefully shuts down with
 # all services already cleanly stopped!
 sub release_ha_agent_lock {
-    my ($self) = @_;
+    my ($self, $node) = @_;
 
-    return $self->{plug}->release_ha_agent_lock();
+    return $self->{plug}->release_ha_agent_lock($node);
 }
 
 # return true when cluster is quorate
diff --git a/src/PVE/HA/Env/PVE2.pm b/src/PVE/HA/Env/PVE2.pm
index ef6485d..c8d1bee 100644
--- a/src/PVE/HA/Env/PVE2.pm
+++ b/src/PVE/HA/Env/PVE2.pm
@@ -327,9 +327,9 @@ sub get_ha_agent_lock {
 # this should only get called if the nodes LRM gracefully shuts down with
 # all services already cleanly stopped!
 sub release_ha_agent_lock {
-    my ($self) = @_;
+    my ($self, $node) = @_;
 
-    my $node = $self->nodename();
+    $node = $node || $self->nodename();
 
     return rmdir("$lockdir/ha_agent_${node}_lock");
 }
diff --git a/src/PVE/HA/Sim/Env.pm b/src/PVE/HA/Sim/Env.pm
index cd1574c..a99687a 100644
--- a/src/PVE/HA/Sim/Env.pm
+++ b/src/PVE/HA/Sim/Env.pm
@@ -75,13 +75,17 @@ sub sim_get_lock {
 	    if (my $d = $data->{$lock_name}) {
 		my $tdiff = $ctime - $d->{time};
 
+		my $manager_node = $data->{'ha_manager_lock'}->{node} || '';
+
+		$res = 0;
 		if ($tdiff > $self->{lock_timeout}) {
 		    $res = 1;
-		} elsif (($tdiff <= $self->{lock_timeout}) && ($d->{node} eq $nodename)) {
-		    delete $data->{$lock_name};
-		    $res = 1;
 		} else {
-		    $res = 0;
+		    # if we aren't manager we may unlock only *our* lock
+		    if ($d->{node} eq $nodename || $manager_node eq $nodename) {
+			delete $data->{$lock_name};
+			$res = 1;
+		    }
 		}
 	    }
 
@@ -284,9 +288,9 @@ sub get_ha_agent_lock {
 # this should only get called if the nodes LRM gracefully shuts down with
 # all services already cleanly stopped!
 sub release_ha_agent_lock {
-    my ($self) = @_;
+    my ($self, $node) = @_;
 
-    my $node = $self->nodename();
+    $node = $node || $self->nodename();
 
     my $lock = $self->get_ha_agent_lock_name($node);
     return $self->sim_get_lock($lock, 1);
-- 
2.1.4





More information about the pve-devel mailing list