[PVE-User] critical HA problem on a PVE6 cluster

Mon May 11 19:13:15 CEST 2020

Hi again, (sorry for the spam!).

I just found logs just before the crash of one of the nodes (time of 
crash : 18:36:36). It could be more useful than logs sent 
previously...(I deleted here normal events)

First, several messages like that (first one at 11:00 am):

May  6 18:33:25 inf-proxmox7 corosync[2648]:   [TOTEM ] Token has not 
been received in 2212 ms
May  6 18:33:26 inf-proxmox7 corosync[2648]:   [TOTEM ] A processor 
failed, forming new configuration.

Then:

May  6 18:34:14 inf-proxmox7 corosync[2648]:   [MAIN  ] Completed 
service synchronization, ready to provide service.
May  6 18:34:14 inf-proxmox7 pvesr[3342642]: error with cfs lock 
'file-replication_cfg': got lock request timeout
May  6 18:34:14 inf-proxmox7 systemd[1]: pvesr.service: Main process 
exited, code=exited, status=17/n/a
May  6 18:34:14 inf-proxmox7 systemd[1]: pvesr.service: Failed with 
result 'exit-code'.
May  6 18:34:14 inf-proxmox7 systemd[1]: Failed to start Proxmox VE 
replication runner.
May  6 18:34:14 inf-proxmox7 pmxcfs[2602]: [status] notice: 
cpg_send_message retry 30
May  6 18:34:14 inf-proxmox7 pmxcfs[2602]: [status] notice: 
cpg_send_message retried 30 times

Then again a series of processor failed messages (in totally 147 before 
the crash):

May  6 18:35:03 inf-proxmox7 corosync[2648]:   [TOTEM ] Token has not 
been received in 2212 ms
May  6 18:35:04 inf-proxmox7 corosync[2648]:   [TOTEM ] A processor 
failed, forming new configuration.

Then:

May  6 18:35:40 inf-proxmox7 pmxcfs[2602]: [dcdb] notice: start cluster 
connection
May  6 18:35:40 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: cpg_join failed: 14
May  6 18:35:40 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: can't initialize 
service
May  6 18:35:40 inf-proxmox7 pve-ha-lrm[5528]: lost lock 
'ha_agent_inf-proxmox7_lock - cfs lock update failed - Device or 
resource busy
May  6 18:35:40 inf-proxmox7 pve-ha-crm[5421]: status change slave => 
wait_for_quorum
May  6 18:35:41 inf-proxmox7 corosync[2648]:   [TOTEM ] A new membership 
(1.e60) was formed. Members joined: 1 3 4 5

Then:

May  6 18:35:41 inf-proxmox7 pmxcfs[2602]: [status] notice: node has quorum
May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: 
cpg_send_message retried 1 times
May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: received 
sync request (epoch 1/2592/00000031)
May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: received 
sync request (epoch 1/2592/00000032)
May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: cpg_send_message 
failed: 9
May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: cpg_send_message 
failed: 9
May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: received all 
states
May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: all data is 
up to date
May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: 
dfsm_deliver_queue: queue length 144

Then:

May  6 18:35:57 inf-proxmox7 corosync[2648]:   [TOTEM ] A new membership 
(1.e64) was formed. Members left: 3 4
May  6 18:35:57 inf-proxmox7 corosync[2648]:   [TOTEM ] Failed to 
receive the leave message. failed: 3 4

And finally crash after this last logs:

May  6 18:36:36 inf-proxmox7 pve-ha-crm[5421]: status change 
wait_for_quorum => slave
May  6 18:36:36 inf-proxmox7 systemd[1]: pvesr.service: Main process 
exited, code=exited, status=17/n/a
May  6 18:36:36 inf-proxmox7 systemd[1]: pvesr.service: Failed with 
result 'exit-code'.
May  6 18:36:36 inf-proxmox7 systemd[1]: Failed to start Proxmox VE 
replication runner.
May  6 18:36:36 inf-proxmox7 pve-ha-crm[5421]: loop take too long (51 
seconds)
May  6 18:36:36 inf-proxmox7 systemd[1]: watchdog-mux.service: Succeeded.
May  6 18:36:36 inf-proxmox7 kernel: [1292969.953131] watchdog: 
watchdog0: watchdog did not stop!
May  6 18:36:36 inf-proxmox7 pvestatd[2894]: status update time (5.201 
seconds)
^@^@^@^@^@^@

following by a binary part...

Thank you again,
Hervé

On 11/05/2020 10:39, Eneko Lacunza wrote:
>>> Hi Hervé,
>>>
>>> This seems a network issue. What is the network setup in this 
>>> cluster? What logs in syslog about corosync and pve-cluster?
>>>
>>> Don't enable HA until you have a stable cluster quorum.
>>>
>>> Cheers
>>> Eneko
>>>
>>> El 11/5/20 a las 10:35, Herve Ballans escribió:
>>>> Hi everybody,
>>>>
>>>> I would like to take the opportunity at the beginning of this new 
>>>> week to ask my issue again.
>>>>
>>>> Has anyone had any idea why a such problem occurred, or is this 
>>>> problem really something new ?
>>>>
>>>> Thanks again,
>>>> Hervé
>>>>
>>>> On 07/05/2020 18:28, Herve Ballans wrote:
>>>>> Hi all,
>>>>>
>>>>> *Cluster info:*
>>>>>
>>>>>  * 5 nodes (version PVE 6.1-3 at the time the problem occured)
>>>>>  * Ceph rbd storage (Nautilus)
>>>>>  * In production since many years with no major issues
>>>>>  * No specific network problems at the time the problem occured
>>>>>  * Nodes are on the same date (configured with the same ntp server)
>>>>>
>>>>> *Symptoms:*
>>>>>
>>>>> Suddenly, last night (around 7 PM), all nodes of our cluster seems 
>>>>> to have rebooted in the same time with no apparent reasons (I 
>>>>> mean, we weren't doing antything on it) !
>>>>> During the reboot, services "Corosync Cluster Engine" and "Proxmox 
>>>>> VE replication runer" failed. After node rebooted, we are obliged 
>>>>> to start those services manually.
>>>>>
>>>>> Once rebooted with all pve services, some nodes were in HA lrm 
>>>>> status : old timestamp - dead? while others were in active status 
>>>>> or in wait_for_agent_lock status ?...
>>>>> Nodes switch states regularly...and it loops back and forth as 
>>>>> long as we don't change the configuration...
>>>>>
>>>>> In the same time, pve-ha-crm service got unexpected error, as for 
>>>>> example : "Configuration file 
>>>>> 'nodes/inf-proxmox6/qemu-server/501.conf' does not exist" even 
>>>>> though the file exists but on an another node !
>>>>> Such message is probably a consequence of the fencing between 
>>>>> nodes due to the change of status...
>>>>>
>>>>> *What we have tried until now to stabilize the situation:*
>>>>>
>>>>> After several investigations and several operations that have 
>>>>> failed to solve anything (in particular a complete upgrade to the 
>>>>> latest PVE version 6.1-11),
>>>>>
>>>>> we finally removed the HA configuration of all the VM.
>>>>> Since, the state seems to be stabilized although, obviously, it is 
>>>>> not nominal !
>>>>>
>>>>> Now, all the nodes are in HA lrm status : idle and sometimes 
>>>>> switch to old timestamp - dead? state, then come back to idle state.
>>>>> None of them are in "active" state.
>>>>> Obviously, quorum status is "no quorum"
>>>>>
>>>>> It will be noted that, as soon as we try to re-activate the HA 
>>>>> status on the VMs, problem occurs again (nodes reboot!) :(
>>>>>
>>>>> *Question:*
>>>>>
>>>>> Have you ever experienced such a problem or do you know a way to 
>>>>> restore a correct HA configuration in this case ?
>>>>> I point out that nodes are currently on version PVE 6.1-11.
>>>>>
>>>>> I can put some specific logs if useful.
>>>>>
>>>>> Thanks in advance for your help,
>>>>> Hervé