[PVE-User] critical HA problem on a PVE6 cluster

Thu May 7 18:28:57 CEST 2020

Hi all,

*Cluster info:*

  * 5 nodes (version PVE 6.1-3 at the time the problem occured)
  * Ceph rbd storage (Nautilus)
  * In production since many years with no major issues
  * No specific network problems at the time the problem occured
  * Nodes are on the same date (configured with the same ntp server)

*Symptoms:*

Suddenly, last night (around 7 PM), all nodes of our cluster seems to 
have rebooted in the same time with no apparent reasons (I mean, we 
weren't doing antything on it) !
During the reboot, services "Corosync Cluster Engine" and "Proxmox VE 
replication runer" failed. After node rebooted, we are obliged to start 
those services manually.

Once rebooted with all pve services, some nodes were in HA lrm status : 
old timestamp - dead? while others were in active status or in 
wait_for_agent_lock status ?...
Nodes switch states regularly...and it loops back and forth as long as 
we don't change the configuration...

In the same time, pve-ha-crm service got unexpected error, as for 
example : "Configuration file 'nodes/inf-proxmox6/qemu-server/501.conf' 
does not exist" even though the file exists but on an another node !
Such message is probably a consequence of the fencing between nodes due 
to the change of status...

*What we have tried until now to stabilize the situation:*

After several investigations and several operations that have failed to 
solve anything (in particular a complete upgrade to the latest PVE 
version 6.1-11),

we finally removed the HA configuration of all the VM.
Since, the state seems to be stabilized although, obviously, it is not 
nominal !

Now, all the nodes are in HA lrm status : idle and sometimes switch to 
old timestamp - dead? state, then come back to idle state.
None of them are in "active" state.
Obviously, quorum status is "no quorum"

It will be noted that, as soon as we try to re-activate the HA status on 
the VMs, problem occurs again (nodes reboot!) :(

*Question:*

Have you ever experienced such a problem or do you know a way to restore 
a correct HA configuration in this case ?
I point out that nodes are currently on version PVE 6.1-11.

I can put some specific logs if useful.

Thanks in advance for your help,
Hervé