[PVE-User] PVE 6.2 Strange cluster node fence

Stefan M. Radman smr at kmi.com
Wed Apr 14 13:22:33 CEST 2021


Hi Eneko

Do you have separate physical interfaces for the cluster (corosync) traffic?
Do you have them on separate VLANs on your switches?
Are you running 1 or 2 corosync rings?

Please post your /etc/network/interfaces and explain which interface connects where.

Thanks

Stefan


On Apr 14, 2021, at 12:12, Eneko Lacunza via pve-user <pve-user at lists.proxmox.com<mailto:pve-user at lists.proxmox.com>> wrote:


From: Eneko Lacunza <elacunza at binovo.es<mailto:elacunza at binovo.es>>
Subject: Re: [PVE-User] PVE 6.2 Strange cluster node fence
Date: April 14, 2021 at 12:12:09 GMT+2
To: pve-user at lists.proxmox.com<mailto:pve-user at lists.proxmox.com>


Hi Michael,

El 14/4/21 a las 11:21, Michael Rasmussen via pve-user escribió:
On Wed, 14 Apr 2021 11:04:10 +0200
Eneko Lacunza via pve-user<pve-user at lists.proxmox.com<mailto:pve-user at lists.proxmox.com>>  wrote:

Hi all,

Yesterday we had a strange fence happen in a PVE 6.2 cluster.

Cluster has 3 nodes (proxmox1, proxmox2, proxmox3) and has been
operating normally for a year. Last update was on January 21st 2021.
Storage is Ceph and nodes are connected to the same network switch
with active-pasive bonds.

proxmox1 was fenced and automatically rebooted, then everything
recovered. HA restarted VMs in other nodes too.

proxmox1 syslog: (no network link issues reported at device level)
I have seen this occasionally and every time the cause was high network
load/network congestion which caused token timeout. The default token
timeout in corosync IMHO is very optimistically configured to 1000 ms
so I have changed this setting to 5000 ms and after I have done this I
have never seen fencing happening caused by network load/network
congestion again. You could try this and see if that helps you.

PS. my cluster communication is on a dedicated gb bonded vlan.
Thanks for the info. In this case network is 10Gbit (I see I didn't include this info) but only for proxmox nodes:

- We have 2 Dell N1124T 24x1Gbit 4xSFP+ switches
- Both switches are interconnected with a SFP+ DAC
- Active-passive Bonds in each proxmox node go one SFP+ interface on each switch. Primary interfaces are configured to be on the same switch.
- Connectivity to the LAN is done with 1 Gbit link
- Proxmox 2x10G Bond is used for VM networking and Ceph public/private networks.

I wouldn't expect high network load/congestion because it's on an internal LAN, with 1Gbit clients. No Ceph issues/backfilling were ocurring during the fence.

Network cards are Broadcom.

Thanks

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/



_______________________________________________
pve-user mailing list
pve-user at lists.proxmox.com
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.proxmox.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-user&data=04%7C01%7Csmr%40kmi.com%7C94935b3774c84a829c8008d8ff2dcd78%7Cc2283768b8d34e008f3d85b1b4f03b33%7C0%7C0%7C637539919485970079%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0Lc31YKv%2Fm4RQEsAZlcdsuA1XidEZEgfmAwRgGT4Dlg%3D&reserved=0



CONFIDENTIALITY NOTICE: This communication may contain privileged and confidential information, or may otherwise be protected from disclosure, and is intended solely for use of the intended recipient(s). If you are not the intended recipient of this communication, please notify the sender that you have received this communication in error and delete and destroy all copies in your possession.


More information about the pve-user mailing list