[PVE-User] critical HA problem on a PVE6 cluster

Thu May 14 16:48:45 CEST 2020

Hi Eneko,

Thanks again for trying to help me.

Now, the problem is solved!  We upgraded our entire cluster in PVE 6.2 
and now all is optimal, including HA status.
We just upgraded each nodes, didn't change anything else (I mean in term 
of configuration file).

Here, I'm just stating a fact, I don't say that this is the upgrade 
process that are solved our problems...

Indeed we are trying to investigate with network engineers who manage 
the network equipments of our datacenter in order to see if something 
was happening at the moment where our cluster had crashed.

I will let you know if I have the answer to that mystery...

Cheers,
Hervé

On 12/05/2020 15:00, Eneko Lacunza wrote:
> Hi Hervé,
>
> El 11/5/20 a las 17:58, Herve Ballans escribió:
>> Thanks for your answer. I was also thinking at first a network issue 
>> but physical network equipments don't seem to be showing any specific 
>> problems...Here are more details on the cluster:
>>
>> 2x10Gb + 2x1Gb interface:
>>
>>  * a 10Gb interface for ceph cluster
>>  * a 10Gb interface for main network cluster
>>  * the other 2 1Gb interfaces are used for two other VLAN for the VMs
>
> Can you post
> "pvecm status" to see cluster network IPs?
> "ip a" for a node?
> "cat /etc/corosync/corosync.conf "?
>
> All network interfaces go to the same switch?
>
> PVE 6.2 has been released and it supports multiple networks for 
> cluster. I suggest you look at it and configure a second network that 
> uses another switch.
>
> In the logs you sent, I can see that there are grave cluster problems, 
> at 18:38:58 I can see only nodes 1,3,4,5 in quorum
>
> Also, at 18:39:01 I can see ceph-mon complaining about slow ops and 
> failed timeout for osd.5 .
>
> I really think there is a network issue. Ceph and Proxmox clusters are 
> completely separate, but they're both having issues.
>
> I'd try to change the networking switch; I'd try even a 1G switch just 
> to see if that makes Proxmox cluster and ceph stable. Are 10G 
> interfaces very loaded?
>
> Cheers
> Eneko
>
>>
>>
>>
>> On 11/05/2020 10:39, Eneko Lacunza wrote:
>>> Hi Hervé,
>>>
>>> This seems a network issue. What is the network setup in this 
>>> cluster? What logs in syslog about corosync and pve-cluster?
>>>
>>> Don't enable HA until you have a stable cluster quorum.
>>>
>>> Cheers
>>> Eneko
>>>
>>> El 11/5/20 a las 10:35, Herve Ballans escribió:
>>>> Hi everybody,
>>>>
>>>> I would like to take the opportunity at the beginning of this new 
>>>> week to ask my issue again.
>>>>
>>>> Has anyone had any idea why a such problem occurred, or is this 
>>>> problem really something new ?
>>>>
>>>> Thanks again,
>>>> Hervé
>>>>
>>>> On 07/05/2020 18:28, Herve Ballans wrote:
>>>>> Hi all,
>>>>>
>>>>> *Cluster info:*
>>>>>
>>>>>  * 5 nodes (version PVE 6.1-3 at the time the problem occured)
>>>>>  * Ceph rbd storage (Nautilus)
>>>>>  * In production since many years with no major issues
>>>>>  * No specific network problems at the time the problem occured
>>>>>  * Nodes are on the same date (configured with the same ntp server)
>>>>>
>>>>> *Symptoms:*
>>>>>
>>>>> Suddenly, last night (around 7 PM), all nodes of our cluster seems 
>>>>> to have rebooted in the same time with no apparent reasons (I 
>>>>> mean, we weren't doing antything on it) !
>>>>> During the reboot, services "Corosync Cluster Engine" and "Proxmox 
>>>>> VE replication runer" failed. After node rebooted, we are obliged 
>>>>> to start those services manually.
>>>>>
>>>>> Once rebooted with all pve services, some nodes were in HA lrm 
>>>>> status : old timestamp - dead? while others were in active status 
>>>>> or in wait_for_agent_lock status ?...
>>>>> Nodes switch states regularly...and it loops back and forth as 
>>>>> long as we don't change the configuration...
>>>>>
>>>>> In the same time, pve-ha-crm service got unexpected error, as for 
>>>>> example : "Configuration file 
>>>>> 'nodes/inf-proxmox6/qemu-server/501.conf' does not exist" even 
>>>>> though the file exists but on an another node !
>>>>> Such message is probably a consequence of the fencing between 
>>>>> nodes due to the change of status...
>>>>>
>>>>> *What we have tried until now to stabilize the situation:*
>>>>>
>>>>> After several investigations and several operations that have 
>>>>> failed to solve anything (in particular a complete upgrade to the 
>>>>> latest PVE version 6.1-11),
>>>>>
>>>>> we finally removed the HA configuration of all the VM.
>>>>> Since, the state seems to be stabilized although, obviously, it is 
>>>>> not nominal !
>>>>>
>>>>> Now, all the nodes are in HA lrm status : idle and sometimes 
>>>>> switch to old timestamp - dead? state, then come back to idle state.
>>>>> None of them are in "active" state.
>>>>> Obviously, quorum status is "no quorum"
>>>>>
>>>>> It will be noted that, as soon as we try to re-activate the HA 
>>>>> status on the VMs, problem occurs again (nodes reboot!) :(
>>>>>
>>>>> *Question:*
>>>>>
>>>>> Have you ever experienced such a problem or do you know a way to 
>>>>> restore a correct HA configuration in this case ?
>>>>> I point out that nodes are currently on version PVE 6.1-11.
>>>>>
>>>>> I can put some specific logs if useful.
>>>>>
>>>>> Thanks in advance for your help,
>>>>> Hervé
>>>>>
>>>>> _______________________________________________
>>>>> pve-user mailing list
>>>>> pve-user at pve.proxmox.com
>>>>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>> _______________________________________________
>>>> pve-user mailing list
>>>> pve-user at pve.proxmox.com
>>>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>
>>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user at pve.proxmox.com
>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
>