[PVE-User] critical HA problem on a PVE6 cluster
Herve Ballans
herve.ballans at ias.u-psud.fr
Mon May 11 19:13:15 CEST 2020
Hi again, (sorry for the spam!).
I just found logs just before the crash of one of the nodes (time of
crash : 18:36:36). It could be more useful than logs sent
previously...(I deleted here normal events)
First, several messages like that (first one at 11:00 am):
May 6 18:33:25 inf-proxmox7 corosync[2648]: [TOTEM ] Token has not
been received in 2212 ms
May 6 18:33:26 inf-proxmox7 corosync[2648]: [TOTEM ] A processor
failed, forming new configuration.
Then:
May 6 18:34:14 inf-proxmox7 corosync[2648]: [MAIN ] Completed
service synchronization, ready to provide service.
May 6 18:34:14 inf-proxmox7 pvesr[3342642]: error with cfs lock
'file-replication_cfg': got lock request timeout
May 6 18:34:14 inf-proxmox7 systemd[1]: pvesr.service: Main process
exited, code=exited, status=17/n/a
May 6 18:34:14 inf-proxmox7 systemd[1]: pvesr.service: Failed with
result 'exit-code'.
May 6 18:34:14 inf-proxmox7 systemd[1]: Failed to start Proxmox VE
replication runner.
May 6 18:34:14 inf-proxmox7 pmxcfs[2602]: [status] notice:
cpg_send_message retry 30
May 6 18:34:14 inf-proxmox7 pmxcfs[2602]: [status] notice:
cpg_send_message retried 30 times
Then again a series of processor failed messages (in totally 147 before
the crash):
May 6 18:35:03 inf-proxmox7 corosync[2648]: [TOTEM ] Token has not
been received in 2212 ms
May 6 18:35:04 inf-proxmox7 corosync[2648]: [TOTEM ] A processor
failed, forming new configuration.
Then:
May 6 18:35:40 inf-proxmox7 pmxcfs[2602]: [dcdb] notice: start cluster
connection
May 6 18:35:40 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: cpg_join failed: 14
May 6 18:35:40 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: can't initialize
service
May 6 18:35:40 inf-proxmox7 pve-ha-lrm[5528]: lost lock
'ha_agent_inf-proxmox7_lock - cfs lock update failed - Device or
resource busy
May 6 18:35:40 inf-proxmox7 pve-ha-crm[5421]: status change slave =>
wait_for_quorum
May 6 18:35:41 inf-proxmox7 corosync[2648]: [TOTEM ] A new membership
(1.e60) was formed. Members joined: 1 3 4 5
Then:
May 6 18:35:41 inf-proxmox7 pmxcfs[2602]: [status] notice: node has quorum
May 6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice:
cpg_send_message retried 1 times
May 6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: received
sync request (epoch 1/2592/00000031)
May 6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: received
sync request (epoch 1/2592/00000032)
May 6 18:35:42 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: cpg_send_message
failed: 9
May 6 18:35:42 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: cpg_send_message
failed: 9
May 6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: received all
states
May 6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: all data is
up to date
May 6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice:
dfsm_deliver_queue: queue length 144
Then:
May 6 18:35:57 inf-proxmox7 corosync[2648]: [TOTEM ] A new membership
(1.e64) was formed. Members left: 3 4
May 6 18:35:57 inf-proxmox7 corosync[2648]: [TOTEM ] Failed to
receive the leave message. failed: 3 4
And finally crash after this last logs:
May 6 18:36:36 inf-proxmox7 pve-ha-crm[5421]: status change
wait_for_quorum => slave
May 6 18:36:36 inf-proxmox7 systemd[1]: pvesr.service: Main process
exited, code=exited, status=17/n/a
May 6 18:36:36 inf-proxmox7 systemd[1]: pvesr.service: Failed with
result 'exit-code'.
May 6 18:36:36 inf-proxmox7 systemd[1]: Failed to start Proxmox VE
replication runner.
May 6 18:36:36 inf-proxmox7 pve-ha-crm[5421]: loop take too long (51
seconds)
May 6 18:36:36 inf-proxmox7 systemd[1]: watchdog-mux.service: Succeeded.
May 6 18:36:36 inf-proxmox7 kernel: [1292969.953131] watchdog:
watchdog0: watchdog did not stop!
May 6 18:36:36 inf-proxmox7 pvestatd[2894]: status update time (5.201
seconds)
^@^@^@^@^@^@
following by a binary part...
Thank you again,
Hervé
On 11/05/2020 10:39, Eneko Lacunza wrote:
>>> Hi Hervé,
>>>
>>> This seems a network issue. What is the network setup in this
>>> cluster? What logs in syslog about corosync and pve-cluster?
>>>
>>> Don't enable HA until you have a stable cluster quorum.
>>>
>>> Cheers
>>> Eneko
>>>
>>> El 11/5/20 a las 10:35, Herve Ballans escribió:
>>>> Hi everybody,
>>>>
>>>> I would like to take the opportunity at the beginning of this new
>>>> week to ask my issue again.
>>>>
>>>> Has anyone had any idea why a such problem occurred, or is this
>>>> problem really something new ?
>>>>
>>>> Thanks again,
>>>> Hervé
>>>>
>>>> On 07/05/2020 18:28, Herve Ballans wrote:
>>>>> Hi all,
>>>>>
>>>>> *Cluster info:*
>>>>>
>>>>> * 5 nodes (version PVE 6.1-3 at the time the problem occured)
>>>>> * Ceph rbd storage (Nautilus)
>>>>> * In production since many years with no major issues
>>>>> * No specific network problems at the time the problem occured
>>>>> * Nodes are on the same date (configured with the same ntp server)
>>>>>
>>>>> *Symptoms:*
>>>>>
>>>>> Suddenly, last night (around 7 PM), all nodes of our cluster seems
>>>>> to have rebooted in the same time with no apparent reasons (I
>>>>> mean, we weren't doing antything on it) !
>>>>> During the reboot, services "Corosync Cluster Engine" and "Proxmox
>>>>> VE replication runer" failed. After node rebooted, we are obliged
>>>>> to start those services manually.
>>>>>
>>>>> Once rebooted with all pve services, some nodes were in HA lrm
>>>>> status : old timestamp - dead? while others were in active status
>>>>> or in wait_for_agent_lock status ?...
>>>>> Nodes switch states regularly...and it loops back and forth as
>>>>> long as we don't change the configuration...
>>>>>
>>>>> In the same time, pve-ha-crm service got unexpected error, as for
>>>>> example : "Configuration file
>>>>> 'nodes/inf-proxmox6/qemu-server/501.conf' does not exist" even
>>>>> though the file exists but on an another node !
>>>>> Such message is probably a consequence of the fencing between
>>>>> nodes due to the change of status...
>>>>>
>>>>> *What we have tried until now to stabilize the situation:*
>>>>>
>>>>> After several investigations and several operations that have
>>>>> failed to solve anything (in particular a complete upgrade to the
>>>>> latest PVE version 6.1-11),
>>>>>
>>>>> we finally removed the HA configuration of all the VM.
>>>>> Since, the state seems to be stabilized although, obviously, it is
>>>>> not nominal !
>>>>>
>>>>> Now, all the nodes are in HA lrm status : idle and sometimes
>>>>> switch to old timestamp - dead? state, then come back to idle state.
>>>>> None of them are in "active" state.
>>>>> Obviously, quorum status is "no quorum"
>>>>>
>>>>> It will be noted that, as soon as we try to re-activate the HA
>>>>> status on the VMs, problem occurs again (nodes reboot!) :(
>>>>>
>>>>> *Question:*
>>>>>
>>>>> Have you ever experienced such a problem or do you know a way to
>>>>> restore a correct HA configuration in this case ?
>>>>> I point out that nodes are currently on version PVE 6.1-11.
>>>>>
>>>>> I can put some specific logs if useful.
>>>>>
>>>>> Thanks in advance for your help,
>>>>> Hervé
More information about the pve-user
mailing list