[PVE-User] critical HA problem on a PVE6 cluster
Herve Ballans
herve.ballans at ias.u-psud.fr
Mon May 11 10:35:19 CEST 2020
Hi everybody,
I would like to take the opportunity at the beginning of this new week
to ask my issue again.
Has anyone had any idea why a such problem occurred, or is this problem
really something new ?
Thanks again,
Hervé
On 07/05/2020 18:28, Herve Ballans wrote:
> Hi all,
>
> *Cluster info:*
>
> * 5 nodes (version PVE 6.1-3 at the time the problem occured)
> * Ceph rbd storage (Nautilus)
> * In production since many years with no major issues
> * No specific network problems at the time the problem occured
> * Nodes are on the same date (configured with the same ntp server)
>
> *Symptoms:*
>
> Suddenly, last night (around 7 PM), all nodes of our cluster seems to
> have rebooted in the same time with no apparent reasons (I mean, we
> weren't doing antything on it) !
> During the reboot, services "Corosync Cluster Engine" and "Proxmox VE
> replication runer" failed. After node rebooted, we are obliged to
> start those services manually.
>
> Once rebooted with all pve services, some nodes were in HA lrm status
> : old timestamp - dead? while others were in active status or in
> wait_for_agent_lock status ?...
> Nodes switch states regularly...and it loops back and forth as long as
> we don't change the configuration...
>
> In the same time, pve-ha-crm service got unexpected error, as for
> example : "Configuration file
> 'nodes/inf-proxmox6/qemu-server/501.conf' does not exist" even though
> the file exists but on an another node !
> Such message is probably a consequence of the fencing between nodes
> due to the change of status...
>
> *What we have tried until now to stabilize the situation:*
>
> After several investigations and several operations that have failed
> to solve anything (in particular a complete upgrade to the latest PVE
> version 6.1-11),
>
> we finally removed the HA configuration of all the VM.
> Since, the state seems to be stabilized although, obviously, it is not
> nominal !
>
> Now, all the nodes are in HA lrm status : idle and sometimes switch to
> old timestamp - dead? state, then come back to idle state.
> None of them are in "active" state.
> Obviously, quorum status is "no quorum"
>
> It will be noted that, as soon as we try to re-activate the HA status
> on the VMs, problem occurs again (nodes reboot!) :(
>
> *Question:*
>
> Have you ever experienced such a problem or do you know a way to
> restore a correct HA configuration in this case ?
> I point out that nodes are currently on version PVE 6.1-11.
>
> I can put some specific logs if useful.
>
> Thanks in advance for your help,
> Hervé
>
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
More information about the pve-user
mailing list