[pve-devel] [RFC ha-manager] manage: handle edge case where a node gets stuck in 'fence' state

Fabian Ebner f.ebner at proxmox.com
Fri Oct 8 14:52:26 CEST 2021


If all services in 'fence' state are gone from a node (e.g. by
removing the services) before fence_node() was successful, a node
would get stuck in the 'fence' state. Avoid this by calling
fence_node() if the node is in 'fence' state, regardless of service
state.

Reported in the community forum:
https://forum.proxmox.com/threads/ha-migration-stuck-is-doing-nothing.94469/

Signed-off-by: Fabian Ebner <f.ebner at proxmox.com>
---

Not really sure if this is worth it, because it's a hard to reach edge
case, but AFAICT there is no good way to get out of being stuck. What
would work is either of:
    * Manually correcting the node state.
    * Adding a service to the stuck node and triggering a fence
      situation.

An alternative would be to keep services in 'fence' state in the
manager state, even if they were removed from the config. But the
approach from this patch seemed a bit more robust: for example, it
will fix an already existing stuck state, rather than just avoid
creating one.

 src/PVE/HA/Manager.pm | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 1c66b43..fc445b1 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -472,6 +472,14 @@ sub manage {
 	    $repeat = 1; # for faster execution
 	}
 
+	# Avoid that a node without services in 'fence' state gets stuck in 'fence' state.
+	for my $node (sort keys $ns->{status}->%*) {
+	    next if $ns->get_node_state($node) ne 'fence';
+	    next if defined($fenced_nodes->{$node});
+
+	    $fenced_nodes->{$node} = $ns->fence_node($node) || 0;
+	}
+
 	last if !$repeat;
     }
 
-- 
2.30.2






More information about the pve-devel mailing list