[PVE-User] critical HA problem on a PVE6 cluster

Tue May 12 15:00:01 CEST 2020

Hi Hervé,

El 11/5/20 a las 17:58, Herve Ballans escribió:
> Thanks for your answer. I was also thinking at first a network issue 
> but physical network equipments don't seem to be showing any specific 
> problems...Here are more details on the cluster:
>
> 2x10Gb + 2x1Gb interface:
>
>  * a 10Gb interface for ceph cluster
>  * a 10Gb interface for main network cluster
>  * the other 2 1Gb interfaces are used for two other VLAN for the VMs

Can you post
"pvecm status" to see cluster network IPs?
"ip a" for a node?
"cat /etc/corosync/corosync.conf "?

All network interfaces go to the same switch?

PVE 6.2 has been released and it supports multiple networks for cluster. 
I suggest you look at it and configure a second network that uses 
another switch.

In the logs you sent, I can see that there are grave cluster problems, 
at 18:38:58 I can see only nodes 1,3,4,5 in quorum

Also, at 18:39:01 I can see ceph-mon complaining about slow ops and 
failed timeout for osd.5 .

I really think there is a network issue. Ceph and Proxmox clusters are 
completely separate, but they're both having issues.

I'd try to change the networking switch; I'd try even a 1G switch just 
to see if that makes Proxmox cluster and ceph stable. Are 10G interfaces 
very loaded?

Cheers
Eneko

>
>
>
> On 11/05/2020 10:39, Eneko Lacunza wrote:
>> Hi Hervé,
>>
>> This seems a network issue. What is the network setup in this 
>> cluster? What logs in syslog about corosync and pve-cluster?
>>
>> Don't enable HA until you have a stable cluster quorum.
>>
>> Cheers
>> Eneko
>>
>> El 11/5/20 a las 10:35, Herve Ballans escribió:
>>> Hi everybody,
>>>
>>> I would like to take the opportunity at the beginning of this new 
>>> week to ask my issue again.
>>>
>>> Has anyone had any idea why a such problem occurred, or is this 
>>> problem really something new ?
>>>
>>> Thanks again,
>>> Hervé
>>>
>>> On 07/05/2020 18:28, Herve Ballans wrote:
>>>> Hi all,
>>>>
>>>> *Cluster info:*
>>>>
>>>>  * 5 nodes (version PVE 6.1-3 at the time the problem occured)
>>>>  * Ceph rbd storage (Nautilus)
>>>>  * In production since many years with no major issues
>>>>  * No specific network problems at the time the problem occured
>>>>  * Nodes are on the same date (configured with the same ntp server)
>>>>
>>>> *Symptoms:*
>>>>
>>>> Suddenly, last night (around 7 PM), all nodes of our cluster seems 
>>>> to have rebooted in the same time with no apparent reasons (I mean, 
>>>> we weren't doing antything on it) !
>>>> During the reboot, services "Corosync Cluster Engine" and "Proxmox 
>>>> VE replication runer" failed. After node rebooted, we are obliged 
>>>> to start those services manually.
>>>>
>>>> Once rebooted with all pve services, some nodes were in HA lrm 
>>>> status : old timestamp - dead? while others were in active status 
>>>> or in wait_for_agent_lock status ?...
>>>> Nodes switch states regularly...and it loops back and forth as long 
>>>> as we don't change the configuration...
>>>>
>>>> In the same time, pve-ha-crm service got unexpected error, as for 
>>>> example : "Configuration file 
>>>> 'nodes/inf-proxmox6/qemu-server/501.conf' does not exist" even 
>>>> though the file exists but on an another node !
>>>> Such message is probably a consequence of the fencing between nodes 
>>>> due to the change of status...
>>>>
>>>> *What we have tried until now to stabilize the situation:*
>>>>
>>>> After several investigations and several operations that have 
>>>> failed to solve anything (in particular a complete upgrade to the 
>>>> latest PVE version 6.1-11),
>>>>
>>>> we finally removed the HA configuration of all the VM.
>>>> Since, the state seems to be stabilized although, obviously, it is 
>>>> not nominal !
>>>>
>>>> Now, all the nodes are in HA lrm status : idle and sometimes switch 
>>>> to old timestamp - dead? state, then come back to idle state.
>>>> None of them are in "active" state.
>>>> Obviously, quorum status is "no quorum"
>>>>
>>>> It will be noted that, as soon as we try to re-activate the HA 
>>>> status on the VMs, problem occurs again (nodes reboot!) :(
>>>>
>>>> *Question:*
>>>>
>>>> Have you ever experienced such a problem or do you know a way to 
>>>> restore a correct HA configuration in this case ?
>>>> I point out that nodes are currently on version PVE 6.1-11.
>>>>
>>>> I can put some specific logs if useful.
>>>>
>>>> Thanks in advance for your help,
>>>> Hervé
>>>>
>>>> _______________________________________________
>>>> pve-user mailing list
>>>> pve-user at pve.proxmox.com
>>>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user at pve.proxmox.com
>>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>>
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

-- 
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarragako bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es