[PVE-User] PVE 6.2 Strange cluster node fence
elacunza at binovo.es
Wed Apr 14 15:18:13 CEST 2021
El 14/4/21 a las 13:22, Stefan M. Radman escribió:
> Hi Eneko
> Do you have separate physical interfaces for the cluster (corosync)
> Do you have them on separate VLANs on your switches?
Onyl Ceph traffic is on VLAN91, the rest is untagged.
> Are you running 1 or 2 corosync rings?
This is standard... no hand tuning:
> Please post your /etc/network/interfaces and explain which interface
> connects where.
iface lo inet loopback
iface ens2f0np0 inet manual
iface ens2f1np1 inet manual
iface eno1 inet manual
iface eno2 inet manual
iface bond0 inet manual
bond-slaves ens2f0np0 ens2f1np1
iface bond0.91 inet static
iface vmbr0 inet static
>> On Apr 14, 2021, at 12:12, Eneko Lacunza via pve-user
>> <pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>> wrote:
>> *From: *Eneko Lacunza <elacunza at binovo.es <mailto:elacunza at binovo.es>>
>> *Subject: **Re: [PVE-User] PVE 6.2 Strange cluster node fence*
>> *Date: *April 14, 2021 at 12:12:09 GMT+2
>> *To: *pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>
>> Hi Michael,
>> El 14/4/21 a las 11:21, Michael Rasmussen via pve-user escribió:
>>> On Wed, 14 Apr 2021 11:04:10 +0200
>>> Eneko Lacunza via pve-user<pve-user at lists.proxmox.com
>>> <mailto:pve-user at lists.proxmox.com>> wrote:
>>>> Hi all,
>>>> Yesterday we had a strange fence happen in a PVE 6.2 cluster.
>>>> Cluster has 3 nodes (proxmox1, proxmox2, proxmox3) and has been
>>>> operating normally for a year. Last update was on January 21st 2021.
>>>> Storage is Ceph and nodes are connected to the same network switch
>>>> with active-pasive bonds.
>>>> proxmox1 was fenced and automatically rebooted, then everything
>>>> recovered. HA restarted VMs in other nodes too.
>>>> proxmox1 syslog: (no network link issues reported at device level)
>>> I have seen this occasionally and every time the cause was high network
>>> load/network congestion which caused token timeout. The default token
>>> timeout in corosync IMHO is very optimistically configured to 1000 ms
>>> so I have changed this setting to 5000 ms and after I have done this I
>>> have never seen fencing happening caused by network load/network
>>> congestion again. You could try this and see if that helps you.
>>> PS. my cluster communication is on a dedicated gb bonded vlan.
>> Thanks for the info. In this case network is 10Gbit (I see I didn't
>> include this info) but only for proxmox nodes:
>> - We have 2 Dell N1124T 24x1Gbit 4xSFP+ switches
>> - Both switches are interconnected with a SFP+ DAC
>> - Active-passive Bonds in each proxmox node go one SFP+ interface on
>> each switch. Primary interfaces are configured to be on the same switch.
>> - Connectivity to the LAN is done with 1 Gbit link
>> - Proxmox 2x10G Bond is used for VM networking and Ceph
>> public/private networks.
>> I wouldn't expect high network load/congestion because it's on an
>> internal LAN, with 1Gbit clients. No Ceph issues/backfilling were
>> ocurring during the fence.
>> Network cards are Broadcom.
>> Eneko Lacunza
>> Zuzendari teknikoa | Director técnico
>> Binovo IT Human Project
>> Tel. +34 943 569 206 | https://www.binovo.es <https://www.binovo.es>
>> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>> pve-user mailing list
>> pve-user at lists.proxmox.com
> CONFIDENTIALITY NOTICE: /This communication may contain privileged and
> confidential information, or may otherwise be protected from
> disclosure, and is intended solely for use of the intended
> recipient(s). If you are not the intended recipient of this
> communication, please notify the sender that you have received this
> communication in error and delete and destroy all copies in your
> possession. /
Zuzendari teknikoa | Director técnico
Binovo IT Human Project
Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
More information about the pve-user