[PVE-User] critical HA problem on a PVE6 cluster
Herve Ballans
herve.ballans at ias.u-psud.fr
Mon May 11 18:24:32 CEST 2020
Sorry my mail was sent too quickly. It misses some required logs.
regarding syslog, some extracts here for corosync (the following log
block will be repeated in a loop, but with different numbers):
May 6 18:38:02 inf-proxmox6 corosync[2674]: [KNET ] link: host: 4
link: 0 is down
May 6 18:38:02 inf-proxmox6 corosync[2674]: [KNET ] host: host: 4
(passive) best link: 0 (pri: 1)
May 6 18:38:02 inf-proxmox6 corosync[2674]: [KNET ] host: host: 4
has no active links
May 6 18:38:05 inf-proxmox6 corosync[2674]: [KNET ] rx: host: 4
link: 0 is up
May 6 18:38:05 inf-proxmox6 corosync[2674]: [KNET ] host: host: 4
(passive) best link: 0 (pri: 1)
May 6 18:38:10 inf-proxmox6 corosync[2674]: [KNET ] link: host: 3
link: 0 is down
May 6 18:38:10 inf-proxmox6 corosync[2674]: [KNET ] host: host: 3
(passive) best link: 0 (pri: 1)
May 6 18:38:10 inf-proxmox6 corosync[2674]: [KNET ] host: host: 3
has no active links
May 6 18:38:12 inf-proxmox6 corosync[2674]: [KNET ] rx: host: 3
link: 0 is up
May 6 18:38:12 inf-proxmox6 corosync[2674]: [KNET ] host: host: 3
(passive) best link: 0 (pri: 1)
May 6 18:38:12 inf-proxmox6 corosync[2674]: [TOTEM ] Retransmit List: 64
May 6 18:38:18 inf-proxmox6 corosync[2674]: [KNET ] link: host: 4
link: 0 is down
May 6 18:38:18 inf-proxmox6 corosync[2674]: [KNET ] host: host: 4
(passive) best link: 0 (pri: 1)
May 6 18:38:18 inf-proxmox6 corosync[2674]: [KNET ] host: host: 4
has no active links
May 6 18:38:19 inf-proxmox6 corosync[2674]: [KNET ] link: host: 3
link: 0 is down
May 6 18:38:19 inf-proxmox6 corosync[2674]: [KNET ] host: host: 3
(passive) best link: 0 (pri: 1)
May 6 18:38:19 inf-proxmox6 corosync[2674]: [KNET ] host: host: 3
has no active links
May 6 18:38:20 inf-proxmox6 corosync[2674]: [KNET ] rx: host: 4
link: 0 is up
May 6 18:38:20 inf-proxmox6 corosync[2674]: [KNET ] host: host: 4
(passive) best link: 0 (pri: 1)
May 6 18:38:21 inf-proxmox6 corosync[2674]: [KNET ] rx: host: 3
link: 0 is up
May 6 18:38:21 inf-proxmox6 corosync[2674]: [KNET ] host: host: 3
(passive) best link: 0 (pri: 1)
May 6 18:38:29 inf-proxmox6 corosync[2674]: [KNET ] link: host: 3
link: 0 is down
May 6 18:38:29 inf-proxmox6 corosync[2674]: [KNET ] link: host: 4
link: 0 is down
May 6 18:38:29 inf-proxmox6 corosync[2674]: [KNET ] host: host: 3
(passive) best link: 0 (pri: 1)
May 6 18:38:29 inf-proxmox6 corosync[2674]: [KNET ] host: host: 3
has no active links
May 6 18:38:29 inf-proxmox6 corosync[2674]: [KNET ] host: host: 4
(passive) best link: 0 (pri: 1)
May 6 18:38:29 inf-proxmox6 corosync[2674]: [KNET ] host: host: 4
has no active links
May 6 18:38:31 inf-proxmox6 corosync[2674]: [TOTEM ] Token has not
been received in 107 ms
May 6 18:38:31 inf-proxmox6 corosync[2674]: [KNET ] rx: host: 3
link: 0 is up
May 6 18:38:31 inf-proxmox6 corosync[2674]: [KNET ] rx: host: 4
link: 0 is up
May 6 18:38:31 inf-proxmox6 corosync[2674]: [KNET ] host: host: 3
(passive) best link: 0 (pri: 1)
May 6 18:38:31 inf-proxmox6 corosync[2674]: [KNET ] host: host: 4
(passive) best link: 0 (pri: 1)
May 6 18:38:42 inf-proxmox6 corosync[2674]: [TOTEM ] Retransmit List: fd
May 6 18:38:42 inf-proxmox6 corosync[2674]: [TOTEM ] Retransmit List:
100
May 6 18:38:42 inf-proxmox6 corosync[2674]: [TOTEM ] Retransmit List:
101
May 6 18:38:42 inf-proxmox6 corosync[2674]: [TOTEM ] Retransmit List:
102
May 6 18:38:42 inf-proxmox6 corosync[2674]: [TOTEM ] Retransmit List:
103
May 6 18:38:42 inf-proxmox6 corosync[2674]: [TOTEM ] Retransmit List:
104
May 6 18:38:42 inf-proxmox6 corosync[2674]: [TOTEM ] Retransmit List:
106
May 6 18:38:42 inf-proxmox6 corosync[2674]: [TOTEM ] Retransmit List:
107
May 6 18:38:42 inf-proxmox6 corosync[2674]: [TOTEM ] Retransmit List:
108
May 6 18:38:42 inf-proxmox6 corosync[2674]: [TOTEM ] Retransmit List:
109
May 6 18:38:44 inf-proxmox6 corosync[2674]: [KNET ] link: host: 3
link: 0 is down
May 6 18:38:44 inf-proxmox6 corosync[2674]: [KNET ] link: host: 4
link: 0 is down
May 6 18:38:44 inf-proxmox6 corosync[2674]: [KNET ] host: host: 3
(passive) best link: 0 (pri: 1)
May 6 18:38:44 inf-proxmox6 corosync[2674]: [KNET ] host: host: 3
has no active links
May 6 18:38:44 inf-proxmox6 corosync[2674]: [KNET ] host: host: 4
(passive) best link: 0 (pri: 1)
May 6 18:38:44 inf-proxmox6 corosync[2674]: [KNET ] host: host: 4
has no active links
May 6 18:38:46 inf-proxmox6 corosync[2674]: [TOTEM ] Token has not
been received in 106 ms
May 6 18:38:46 inf-proxmox6 corosync[2674]: [KNET ] rx: host: 4
link: 0 is up
May 6 18:38:46 inf-proxmox6 corosync[2674]: [KNET ] host: host: 4
(passive) best link: 0 (pri: 1)
May 6 18:38:47 inf-proxmox6 corosync[2674]: [KNET ] rx: host: 3
link: 0 is up
May 6 18:38:47 inf-proxmox6 corosync[2674]: [KNET ] host: host: 3
(passive) best link: 0 (pri: 1)
May 6 18:38:51 inf-proxmox6 corosync[2674]: [TOTEM ] Token has not
been received in 4511 ms
May 6 18:38:52 inf-proxmox6 corosync[2674]: [TOTEM ] A new membership
(1.ea8) was formed. Members
May 6 18:38:52 inf-proxmox6 corosync[2674]: [CPG ] downlist
left_list: 0 received
May 6 18:38:52 inf-proxmox6 corosync[2674]: [CPG ] downlist
left_list: 0 received
May 6 18:38:52 inf-proxmox6 corosync[2674]: [CPG ] downlist
left_list: 0 received
May 6 18:38:52 inf-proxmox6 corosync[2674]: [CPG ] downlist
left_list: 0 received
May 6 18:38:52 inf-proxmox6 corosync[2674]: [QUORUM] Members[4]: 1 3 4 5
May 6 18:38:52 inf-proxmox6 corosync[2674]: [MAIN ] Completed
service synchronization, ready to provide service.
Nothing really relevant regarding the pve-cluster in the logs as it
marks succeed ?..for instance here:
May 6 22:17:33 inf-proxmox6 systemd[1]: Stopping The Proxmox VE cluster
filesystem...
May 6 22:17:33 inf-proxmox6 pmxcfs[2561]: [main] notice: teardown
filesystem
May 6 22:17:33 inf-proxmox6 pvestatd[2906]: status update time (19.854
seconds)
May 6 22:17:34 inf-proxmox6 systemd[7888]: etc-pve.mount: Succeeded.
May 6 22:17:34 inf-proxmox6 systemd[1]: etc-pve.mount: Succeeded.
May 6 22:17:34 inf-proxmox6 pvestatd[2906]: rados_connect failed -
Operation not supported
May 6 22:17:34 inf-proxmox6 pvestatd[2906]: rados_connect failed -
Operation not supported
May 6 22:17:34 inf-proxmox6 pvestatd[2906]: rados_connect failed -
Operation not supported
May 6 22:17:34 inf-proxmox6 pvestatd[2906]: rados_connect failed -
Operation not supported
May 6 22:17:35 inf-proxmox6 pmxcfs[2561]: [main] notice: exit proxmox
configuration filesystem (0)
May 6 22:17:35 inf-proxmox6 systemd[1]: pve-cluster.service: Succeeded.
May 6 22:17:35 inf-proxmox6 systemd[1]: Stopped The Proxmox VE cluster
filesystem.
May 6 22:17:35 inf-proxmox6 systemd[1]: Starting The Proxmox VE cluster
filesystem...
May 6 22:17:35 inf-proxmox6 pmxcfs[8260]: [status] notice: update
cluster info (cluster name cluster-proxmox, version = 6)
May 6 22:17:35 inf-proxmox6 corosync[8007]: [TOTEM ] A new membership
(1.1998) was formed. Members joined: 2 3 4 5
May 6 22:17:36 inf-proxmox6 systemd[1]: Started The Proxmox VE cluster
filesystem.
Here is another extract that shows also some slow ops on a Ceph osd:
May 6 18:38:59 inf-proxmox6 corosync[2674]: [TOTEM ] Token has not
been received in 3810 ms
May 6 18:39:00 inf-proxmox6 systemd[1]: Starting Proxmox VE replication
runner...
May 6 18:39:01 inf-proxmox6 ceph-mon[1119484]: 2020-05-06 18:39:01.493
7feaed4bb700 -1 mon.0 at 0(leader) e6 get_health_metrics reporting 46 slow
ops, oldest is osd_failure(failed timeout osd.5
[v2:192.168.217.8:6884/1879695,v1:192.168.217.8:6885/1879695] for 20sec
e73191 v73191)
May 6 18:39:02 inf-proxmox6 corosync[2674]: [TOTEM ] A new membership
(1.eb4) was formed. Members
May 6 18:39:02 inf-proxmox6 corosync[2674]: [CPG ] downlist
left_list: 0 received
May 6 18:39:02 inf-proxmox6 corosync[2674]: [CPG ] downlist
left_list: 0 received
May 6 18:39:02 inf-proxmox6 corosync[2674]: [CPG ] downlist
left_list: 0 received
May 6 18:39:02 inf-proxmox6 corosync[2674]: [CPG ] downlist
left_list: 0 received
May 6 18:39:02 inf-proxmox6 corosync[2674]: [QUORUM] Members[4]: 1 3 4 5
May 6 18:39:02 inf-proxmox6 corosync[2674]: [MAIN ] Completed
service synchronization, ready to provide service.
May 6 18:39:02 inf-proxmox6 pvesr[1409653]: trying to acquire cfs lock
'file-replication_cfg' ...
May 6 18:39:03 inf-proxmox6 pvesr[1409653]: trying to acquire cfs lock
'file-replication_cfg' ...
May 6 18:39:06 inf-proxmox6 systemd[1]: pvesr.service: Succeeded.
May 6 18:39:06 inf-proxmox6 systemd[1]: Started Proxmox VE replication
runner.
May 6 18:39:06 inf-proxmox6 ceph-mon[1119484]: 2020-05-06 18:39:06.493
7feaed4bb700 -1 mon.0 at 0(leader) e6 get_health_metrics reporting 46 slow
ops, oldest is osd_failure(failed timeout osd.5
[v2:192.168.217.8:6884/1879695,v1:192.168.217.8:6885/1879695] for 20sec
e73191 v73191)
In case of, if that makes any sense for someone, thank you again,
Hervé
On 11/05/2020 17:58, Herve Ballans wrote:
>
> Hi Eneko,
>
> Thanks for your answer. I was also thinking at first a network issue
> but physical network equipments don't seem to be showing any specific
> problems...Here are more details on the cluster:
>
> 2x10Gb + 2x1Gb interface:
>
> * a 10Gb interface for ceph cluster
> * a 10Gb interface for main network cluster
> * the other 2 1Gb interfaces are used for two other VLAN for the VMs
>
>
>
> On 11/05/2020 10:39, Eneko Lacunza wrote:
>> Hi Hervé,
>>
>> This seems a network issue. What is the network setup in this
>> cluster? What logs in syslog about corosync and pve-cluster?
>>
>> Don't enable HA until you have a stable cluster quorum.
>>
>> Cheers
>> Eneko
>>
>> El 11/5/20 a las 10:35, Herve Ballans escribió:
>>> Hi everybody,
>>>
>>> I would like to take the opportunity at the beginning of this new
>>> week to ask my issue again.
>>>
>>> Has anyone had any idea why a such problem occurred, or is this
>>> problem really something new ?
>>>
>>> Thanks again,
>>> Hervé
>>>
>>> On 07/05/2020 18:28, Herve Ballans wrote:
>>>> Hi all,
>>>>
>>>> *Cluster info:*
>>>>
>>>> * 5 nodes (version PVE 6.1-3 at the time the problem occured)
>>>> * Ceph rbd storage (Nautilus)
>>>> * In production since many years with no major issues
>>>> * No specific network problems at the time the problem occured
>>>> * Nodes are on the same date (configured with the same ntp server)
>>>>
>>>> *Symptoms:*
>>>>
>>>> Suddenly, last night (around 7 PM), all nodes of our cluster seems
>>>> to have rebooted in the same time with no apparent reasons (I mean,
>>>> we weren't doing antything on it) !
>>>> During the reboot, services "Corosync Cluster Engine" and "Proxmox
>>>> VE replication runer" failed. After node rebooted, we are obliged
>>>> to start those services manually.
>>>>
>>>> Once rebooted with all pve services, some nodes were in HA lrm
>>>> status : old timestamp - dead? while others were in active status
>>>> or in wait_for_agent_lock status ?...
>>>> Nodes switch states regularly...and it loops back and forth as long
>>>> as we don't change the configuration...
>>>>
>>>> In the same time, pve-ha-crm service got unexpected error, as for
>>>> example : "Configuration file
>>>> 'nodes/inf-proxmox6/qemu-server/501.conf' does not exist" even
>>>> though the file exists but on an another node !
>>>> Such message is probably a consequence of the fencing between nodes
>>>> due to the change of status...
>>>>
>>>> *What we have tried until now to stabilize the situation:*
>>>>
>>>> After several investigations and several operations that have
>>>> failed to solve anything (in particular a complete upgrade to the
>>>> latest PVE version 6.1-11),
>>>>
>>>> we finally removed the HA configuration of all the VM.
>>>> Since, the state seems to be stabilized although, obviously, it is
>>>> not nominal !
>>>>
>>>> Now, all the nodes are in HA lrm status : idle and sometimes switch
>>>> to old timestamp - dead? state, then come back to idle state.
>>>> None of them are in "active" state.
>>>> Obviously, quorum status is "no quorum"
>>>>
>>>> It will be noted that, as soon as we try to re-activate the HA
>>>> status on the VMs, problem occurs again (nodes reboot!) :(
>>>>
>>>> *Question:*
>>>>
>>>> Have you ever experienced such a problem or do you know a way to
>>>> restore a correct HA configuration in this case ?
>>>> I point out that nodes are currently on version PVE 6.1-11.
>>>>
>>>> I can put some specific logs if useful.
>>>>
>>>> Thanks in advance for your help,
>>>> Hervé
>>>>
>>>> _______________________________________________
>>>> pve-user mailing list
>>>> pve-user at pve.proxmox.com
>>>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user at pve.proxmox.com
>>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>>
More information about the pve-user
mailing list