[PVE-User] critical HA problem on a PVE6 cluster
Herve Ballans
herve.ballans at ias.u-psud.fr
Mon May 11 17:58:21 CEST 2020
Hi Eneko,
Thanks for your answer. I was also thinking at first a network issue but
physical network equipments don't seem to be showing any specific
problems...Here are more details on the cluster:
2x10Gb + 2x1Gb interface:
* a 10Gb interface for ceph cluster
* a 10Gb interface for main network cluster
* the other 2 1Gb interfaces are used for two other VLAN for the VMs
On 11/05/2020 10:39, Eneko Lacunza wrote:
> Hi Hervé,
>
> This seems a network issue. What is the network setup in this cluster?
> What logs in syslog about corosync and pve-cluster?
>
> Don't enable HA until you have a stable cluster quorum.
>
> Cheers
> Eneko
>
> El 11/5/20 a las 10:35, Herve Ballans escribió:
>> Hi everybody,
>>
>> I would like to take the opportunity at the beginning of this new
>> week to ask my issue again.
>>
>> Has anyone had any idea why a such problem occurred, or is this
>> problem really something new ?
>>
>> Thanks again,
>> Hervé
>>
>> On 07/05/2020 18:28, Herve Ballans wrote:
>>> Hi all,
>>>
>>> *Cluster info:*
>>>
>>> * 5 nodes (version PVE 6.1-3 at the time the problem occured)
>>> * Ceph rbd storage (Nautilus)
>>> * In production since many years with no major issues
>>> * No specific network problems at the time the problem occured
>>> * Nodes are on the same date (configured with the same ntp server)
>>>
>>> *Symptoms:*
>>>
>>> Suddenly, last night (around 7 PM), all nodes of our cluster seems
>>> to have rebooted in the same time with no apparent reasons (I mean,
>>> we weren't doing antything on it) !
>>> During the reboot, services "Corosync Cluster Engine" and "Proxmox
>>> VE replication runer" failed. After node rebooted, we are obliged to
>>> start those services manually.
>>>
>>> Once rebooted with all pve services, some nodes were in HA lrm
>>> status : old timestamp - dead? while others were in active status or
>>> in wait_for_agent_lock status ?...
>>> Nodes switch states regularly...and it loops back and forth as long
>>> as we don't change the configuration...
>>>
>>> In the same time, pve-ha-crm service got unexpected error, as for
>>> example : "Configuration file
>>> 'nodes/inf-proxmox6/qemu-server/501.conf' does not exist" even
>>> though the file exists but on an another node !
>>> Such message is probably a consequence of the fencing between nodes
>>> due to the change of status...
>>>
>>> *What we have tried until now to stabilize the situation:*
>>>
>>> After several investigations and several operations that have failed
>>> to solve anything (in particular a complete upgrade to the latest
>>> PVE version 6.1-11),
>>>
>>> we finally removed the HA configuration of all the VM.
>>> Since, the state seems to be stabilized although, obviously, it is
>>> not nominal !
>>>
>>> Now, all the nodes are in HA lrm status : idle and sometimes switch
>>> to old timestamp - dead? state, then come back to idle state.
>>> None of them are in "active" state.
>>> Obviously, quorum status is "no quorum"
>>>
>>> It will be noted that, as soon as we try to re-activate the HA
>>> status on the VMs, problem occurs again (nodes reboot!) :(
>>>
>>> *Question:*
>>>
>>> Have you ever experienced such a problem or do you know a way to
>>> restore a correct HA configuration in this case ?
>>> I point out that nodes are currently on version PVE 6.1-11.
>>>
>>> I can put some specific logs if useful.
>>>
>>> Thanks in advance for your help,
>>> Hervé
>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user at pve.proxmox.com
>>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>> _______________________________________________
>> pve-user mailing list
>> pve-user at pve.proxmox.com
>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
>
More information about the pve-user
mailing list