[pve-devel] [RFC ha-manager] manage: handle edge case where a node gets stuck in 'fence' state

Fabian Ebner f.ebner at proxmox.com
Fri Oct 8 14:52:26 CEST 2021

If all services in 'fence' state are gone from a node (e.g. by
removing the services) before fence_node() was successful, a node
would get stuck in the 'fence' state. Avoid this by calling
fence_node() if the node is in 'fence' state, regardless of service

Reported in the community forum:

Signed-off-by: Fabian Ebner <f.ebner at proxmox.com>

Not really sure if this is worth it, because it's a hard to reach edge
case, but AFAICT there is no good way to get out of being stuck. What
would work is either of:
    * Manually correcting the node state.
    * Adding a service to the stuck node and triggering a fence

An alternative would be to keep services in 'fence' state in the
manager state, even if they were removed from the config. But the
approach from this patch seemed a bit more robust: for example, it
will fix an already existing stuck state, rather than just avoid
creating one.

 src/PVE/HA/Manager.pm | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 1c66b43..fc445b1 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -472,6 +472,14 @@ sub manage {
 	    $repeat = 1; # for faster execution
+	# Avoid that a node without services in 'fence' state gets stuck in 'fence' state.
+	for my $node (sort keys $ns->{status}->%*) {
+	    next if $ns->get_node_state($node) ne 'fence';
+	    next if defined($fenced_nodes->{$node});
+	    $fenced_nodes->{$node} = $ns->fence_node($node) || 0;
+	}
 	last if !$repeat;

More information about the pve-devel mailing list