[PVE-User] critical HA problem on a PVE6 cluster

Thu May 14 17:17:40 CEST 2020

Hi Hervé,

Glad to read this :)

Cheers

El 14/5/20 a las 16:48, Herve Ballans escribió:
> Hi Eneko,
>
> Thanks again for trying to help me.
>
> Now, the problem is solved!  We upgraded our entire cluster in PVE 6.2 
> and now all is optimal, including HA status.
> We just upgraded each nodes, didn't change anything else (I mean in 
> term of configuration file).
>
> Here, I'm just stating a fact, I don't say that this is the upgrade 
> process that are solved our problems...
>
> Indeed we are trying to investigate with network engineers who manage 
> the network equipments of our datacenter in order to see if something 
> was happening at the moment where our cluster had crashed.
>
> I will let you know if I have the answer to that mystery...
>
> Cheers,
> Hervé
>
> On 12/05/2020 15:00, Eneko Lacunza wrote:
>> Hi Hervé,
>>
>> El 11/5/20 a las 17:58, Herve Ballans escribió:
>>> Thanks for your answer. I was also thinking at first a network issue 
>>> but physical network equipments don't seem to be showing any 
>>> specific problems...Here are more details on the cluster:
>>>
>>> 2x10Gb + 2x1Gb interface:
>>>
>>>  * a 10Gb interface for ceph cluster
>>>  * a 10Gb interface for main network cluster
>>>  * the other 2 1Gb interfaces are used for two other VLAN for the VMs
>>
>> Can you post
>> "pvecm status" to see cluster network IPs?
>> "ip a" for a node?
>> "cat /etc/corosync/corosync.conf "?
>>
>> All network interfaces go to the same switch?
>>
>> PVE 6.2 has been released and it supports multiple networks for 
>> cluster. I suggest you look at it and configure a second network that 
>> uses another switch.
>>
>> In the logs you sent, I can see that there are grave cluster 
>> problems, at 18:38:58 I can see only nodes 1,3,4,5 in quorum
>>
>> Also, at 18:39:01 I can see ceph-mon complaining about slow ops and 
>> failed timeout for osd.5 .
>>
>> I really think there is a network issue. Ceph and Proxmox clusters 
>> are completely separate, but they're both having issues.
>>
>> I'd try to change the networking switch; I'd try even a 1G switch 
>> just to see if that makes Proxmox cluster and ceph stable. Are 10G 
>> interfaces very loaded?
>>
>> Cheers
>> Eneko
>>
>>>
>>>
>>>
>>> On 11/05/2020 10:39, Eneko Lacunza wrote:
>>>> Hi Hervé,
>>>>
>>>> This seems a network issue. What is the network setup in this 
>>>> cluster? What logs in syslog about corosync and pve-cluster?
>>>>
>>>> Don't enable HA until you have a stable cluster quorum.
>>>>
>>>> Cheers
>>>> Eneko
>>>>
>>>> El 11/5/20 a las 10:35, Herve Ballans escribió:
>>>>> Hi everybody,
>>>>>
>>>>> I would like to take the opportunity at the beginning of this new 
>>>>> week to ask my issue again.
>>>>>
>>>>> Has anyone had any idea why a such problem occurred, or is this 
>>>>> problem really something new ?
>>>>>
>>>>> Thanks again,
>>>>> Hervé
>>>>>
>>>>> On 07/05/2020 18:28, Herve Ballans wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> *Cluster info:*
>>>>>>
>>>>>>  * 5 nodes (version PVE 6.1-3 at the time the problem occured)
>>>>>>  * Ceph rbd storage (Nautilus)
>>>>>>  * In production since many years with no major issues
>>>>>>  * No specific network problems at the time the problem occured
>>>>>>  * Nodes are on the same date (configured with the same ntp server)
>>>>>>
>>>>>> *Symptoms:*
>>>>>>
>>>>>> Suddenly, last night (around 7 PM), all nodes of our cluster 
>>>>>> seems to have rebooted in the same time with no apparent reasons 
>>>>>> (I mean, we weren't doing antything on it) !
>>>>>> During the reboot, services "Corosync Cluster Engine" and 
>>>>>> "Proxmox VE replication runer" failed. After node rebooted, we 
>>>>>> are obliged to start those services manually.
>>>>>>
>>>>>> Once rebooted with all pve services, some nodes were in HA lrm 
>>>>>> status : old timestamp - dead? while others were in active status 
>>>>>> or in wait_for_agent_lock status ?...
>>>>>> Nodes switch states regularly...and it loops back and forth as 
>>>>>> long as we don't change the configuration...
>>>>>>
>>>>>> In the same time, pve-ha-crm service got unexpected error, as for 
>>>>>> example : "Configuration file 
>>>>>> 'nodes/inf-proxmox6/qemu-server/501.conf' does not exist" even 
>>>>>> though the file exists but on an another node !
>>>>>> Such message is probably a consequence of the fencing between 
>>>>>> nodes due to the change of status...
>>>>>>
>>>>>> *What we have tried until now to stabilize the situation:*
>>>>>>
>>>>>> After several investigations and several operations that have 
>>>>>> failed to solve anything (in particular a complete upgrade to the 
>>>>>> latest PVE version 6.1-11),
>>>>>>
>>>>>> we finally removed the HA configuration of all the VM.
>>>>>> Since, the state seems to be stabilized although, obviously, it 
>>>>>> is not nominal !
>>>>>>
>>>>>> Now, all the nodes are in HA lrm status : idle and sometimes 
>>>>>> switch to old timestamp - dead? state, then come back to idle state.
>>>>>> None of them are in "active" state.
>>>>>> Obviously, quorum status is "no quorum"
>>>>>>
>>>>>> It will be noted that, as soon as we try to re-activate the HA 
>>>>>> status on the VMs, problem occurs again (nodes reboot!) :(
>>>>>>
>>>>>> *Question:*
>>>>>>
>>>>>> Have you ever experienced such a problem or do you know a way to 
>>>>>> restore a correct HA configuration in this case ?
>>>>>> I point out that nodes are currently on version PVE 6.1-11.
>>>>>>
>>>>>> I can put some specific logs if useful.
>>>>>>
>>>>>> Thanks in advance for your help,
>>>>>> Hervé
>>>>>>
>>>>>> _______________________________________________
>>>>>> pve-user mailing list
>>>>>> pve-user at pve.proxmox.com
>>>>>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>> _______________________________________________
>>>>> pve-user mailing list
>>>>> pve-user at pve.proxmox.com
>>>>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>
>>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user at pve.proxmox.com
>>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>>

-- 
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarragako bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es