[pve-devel] applied: [RFC ha-manager] manage: handle edge case where a node gets stuck in 'fence' state

Thomas Lamprecht t.lamprecht at proxmox.com
Wed Jan 19 14:36:09 CET 2022

On 08.10.21 14:52, Fabian Ebner wrote:
> If all services in 'fence' state are gone from a node (e.g. by
> removing the services) before fence_node() was successful, a node
> would get stuck in the 'fence' state. Avoid this by calling
> fence_node() if the node is in 'fence' state, regardless of service
> state.
> Reported in the community forum:
> https://forum.proxmox.com/threads/ha-migration-stuck-is-doing-nothing.94469/
> Signed-off-by: Fabian Ebner <f.ebner at proxmox.com>
> ---
> Not really sure if this is worth it, because it's a hard to reach edge
> case, but AFAICT there is no good way to get out of being stuck. What
> would work is either of:
>     * Manually correcting the node state.
>     * Adding a service to the stuck node and triggering a fence
>       situation.
> An alternative would be to keep services in 'fence' state in the
> manager state, even if they were removed from the config. But the
> approach from this patch seemed a bit more robust: for example, it
> will fix an already existing stuck state, rather than just avoid
> creating one.
>  src/PVE/HA/Manager.pm | 8 ++++++++
>  1 file changed, 8 insertions(+)

applied, thanks!

As also discussed off-list I noticed a related issue to a derived edge-case,
that could cause trouble too. Spent some time in coming up with two tests
covering your fixed situation plus also mine, expanding the capabilities of
the test/simulation system slightly.


More information about the pve-devel mailing list