[PVE-User] critical HA problem on a PVE6 cluster

Mon May 11 19:33:31 CEST 2020

As Eneko already said, this really sounds like a network problem - if your
hosts lose connectivity to each other they will reboot themselves, and it
sounds like this is what happened to you.

You are sure there has been no changes to your network around the time this
happened? Have you checked your switch config is still right (maybe it
reset?)

Maybe the switches have bugged out and need a reboot? check the logs on
them for errors.

On Mon, 11 May 2020 at 18:13, Herve Ballans <herve.ballans at ias.u-psud.fr>
wrote:

> Hi again, (sorry for the spam!).
>
> I just found logs just before the crash of one of the nodes (time of
> crash : 18:36:36). It could be more useful than logs sent
> previously...(I deleted here normal events)
>
> First, several messages like that (first one at 11:00 am):
>
> May  6 18:33:25 inf-proxmox7 corosync[2648]:   [TOTEM ] Token has not
> been received in 2212 ms
> May  6 18:33:26 inf-proxmox7 corosync[2648]:   [TOTEM ] A processor
> failed, forming new configuration.
>
> Then:
>
> May  6 18:34:14 inf-proxmox7 corosync[2648]:   [MAIN  ] Completed
> service synchronization, ready to provide service.
> May  6 18:34:14 inf-proxmox7 pvesr[3342642]: error with cfs lock
> 'file-replication_cfg': got lock request timeout
> May  6 18:34:14 inf-proxmox7 systemd[1]: pvesr.service: Main process
> exited, code=exited, status=17/n/a
> May  6 18:34:14 inf-proxmox7 systemd[1]: pvesr.service: Failed with
> result 'exit-code'.
> May  6 18:34:14 inf-proxmox7 systemd[1]: Failed to start Proxmox VE
> replication runner.
> May  6 18:34:14 inf-proxmox7 pmxcfs[2602]: [status] notice:
> cpg_send_message retry 30
> May  6 18:34:14 inf-proxmox7 pmxcfs[2602]: [status] notice:
> cpg_send_message retried 30 times
>
> Then again a series of processor failed messages (in totally 147 before
> the crash):
>
> May  6 18:35:03 inf-proxmox7 corosync[2648]:   [TOTEM ] Token has not
> been received in 2212 ms
> May  6 18:35:04 inf-proxmox7 corosync[2648]:   [TOTEM ] A processor
> failed, forming new configuration.
>
> Then:
>
> May  6 18:35:40 inf-proxmox7 pmxcfs[2602]: [dcdb] notice: start cluster
> connection
> May  6 18:35:40 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: cpg_join failed: 14
> May  6 18:35:40 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: can't initialize
> service
> May  6 18:35:40 inf-proxmox7 pve-ha-lrm[5528]: lost lock
> 'ha_agent_inf-proxmox7_lock - cfs lock update failed - Device or
> resource busy
> May  6 18:35:40 inf-proxmox7 pve-ha-crm[5421]: status change slave =>
> wait_for_quorum
> May  6 18:35:41 inf-proxmox7 corosync[2648]:   [TOTEM ] A new membership
> (1.e60) was formed. Members joined: 1 3 4 5
>
> Then:
>
> May  6 18:35:41 inf-proxmox7 pmxcfs[2602]: [status] notice: node has quorum
> May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice:
> cpg_send_message retried 1 times
> May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: received
> sync request (epoch 1/2592/00000031)
> May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: received
> sync request (epoch 1/2592/00000032)
> May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: cpg_send_message
> failed: 9
> May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: cpg_send_message
> failed: 9
> May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: received all
> states
> May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: all data is
> up to date
> May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice:
> dfsm_deliver_queue: queue length 144
>
> Then:
>
> May  6 18:35:57 inf-proxmox7 corosync[2648]:   [TOTEM ] A new membership
> (1.e64) was formed. Members left: 3 4
> May  6 18:35:57 inf-proxmox7 corosync[2648]:   [TOTEM ] Failed to
> receive the leave message. failed: 3 4
>
> And finally crash after this last logs:
>
> May  6 18:36:36 inf-proxmox7 pve-ha-crm[5421]: status change
> wait_for_quorum => slave
> May  6 18:36:36 inf-proxmox7 systemd[1]: pvesr.service: Main process
> exited, code=exited, status=17/n/a
> May  6 18:36:36 inf-proxmox7 systemd[1]: pvesr.service: Failed with
> result 'exit-code'.
> May  6 18:36:36 inf-proxmox7 systemd[1]: Failed to start Proxmox VE
> replication runner.
> May  6 18:36:36 inf-proxmox7 pve-ha-crm[5421]: loop take too long (51
> seconds)
> May  6 18:36:36 inf-proxmox7 systemd[1]: watchdog-mux.service: Succeeded.
> May  6 18:36:36 inf-proxmox7 kernel: [1292969.953131] watchdog:
> watchdog0: watchdog did not stop!
> May  6 18:36:36 inf-proxmox7 pvestatd[2894]: status update time (5.201
> seconds)
> ^@^@^@^@^@^@
>
> following by a binary part...
>
> Thank you again,
> Hervé
>
> On 11/05/2020 10:39, Eneko Lacunza wrote:
> >>> Hi Hervé,
> >>>
> >>> This seems a network issue. What is the network setup in this
> >>> cluster? What logs in syslog about corosync and pve-cluster?
> >>>
> >>> Don't enable HA until you have a stable cluster quorum.
> >>>
> >>> Cheers
> >>> Eneko
> >>>
> >>> El 11/5/20 a las 10:35, Herve Ballans escribió:
> >>>> Hi everybody,
> >>>>
> >>>> I would like to take the opportunity at the beginning of this new
> >>>> week to ask my issue again.
> >>>>
> >>>> Has anyone had any idea why a such problem occurred, or is this
> >>>> problem really something new ?
> >>>>
> >>>> Thanks again,
> >>>> Hervé
> >>>>
> >>>> On 07/05/2020 18:28, Herve Ballans wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> *Cluster info:*
> >>>>>
> >>>>>  * 5 nodes (version PVE 6.1-3 at the time the problem occured)
> >>>>>  * Ceph rbd storage (Nautilus)
> >>>>>  * In production since many years with no major issues
> >>>>>  * No specific network problems at the time the problem occured
> >>>>>  * Nodes are on the same date (configured with the same ntp server)
> >>>>>
> >>>>> *Symptoms:*
> >>>>>
> >>>>> Suddenly, last night (around 7 PM), all nodes of our cluster seems
> >>>>> to have rebooted in the same time with no apparent reasons (I
> >>>>> mean, we weren't doing antything on it) !
> >>>>> During the reboot, services "Corosync Cluster Engine" and "Proxmox
> >>>>> VE replication runer" failed. After node rebooted, we are obliged
> >>>>> to start those services manually.
> >>>>>
> >>>>> Once rebooted with all pve services, some nodes were in HA lrm
> >>>>> status : old timestamp - dead? while others were in active status
> >>>>> or in wait_for_agent_lock status ?...
> >>>>> Nodes switch states regularly...and it loops back and forth as
> >>>>> long as we don't change the configuration...
> >>>>>
> >>>>> In the same time, pve-ha-crm service got unexpected error, as for
> >>>>> example : "Configuration file
> >>>>> 'nodes/inf-proxmox6/qemu-server/501.conf' does not exist" even
> >>>>> though the file exists but on an another node !
> >>>>> Such message is probably a consequence of the fencing between
> >>>>> nodes due to the change of status...
> >>>>>
> >>>>> *What we have tried until now to stabilize the situation:*
> >>>>>
> >>>>> After several investigations and several operations that have
> >>>>> failed to solve anything (in particular a complete upgrade to the
> >>>>> latest PVE version 6.1-11),
> >>>>>
> >>>>> we finally removed the HA configuration of all the VM.
> >>>>> Since, the state seems to be stabilized although, obviously, it is
> >>>>> not nominal !
> >>>>>
> >>>>> Now, all the nodes are in HA lrm status : idle and sometimes
> >>>>> switch to old timestamp - dead? state, then come back to idle state.
> >>>>> None of them are in "active" state.
> >>>>> Obviously, quorum status is "no quorum"
> >>>>>
> >>>>> It will be noted that, as soon as we try to re-activate the HA
> >>>>> status on the VMs, problem occurs again (nodes reboot!) :(
> >>>>>
> >>>>> *Question:*
> >>>>>
> >>>>> Have you ever experienced such a problem or do you know a way to
> >>>>> restore a correct HA configuration in this case ?
> >>>>> I point out that nodes are currently on version PVE 6.1-11.
> >>>>>
> >>>>> I can put some specific logs if useful.
> >>>>>
> >>>>> Thanks in advance for your help,
> >>>>> Hervé
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user