[PVE-User] PVE 6.2 Strange cluster node fence

Wed Apr 14 18:26:08 CEST 2021

Hi,

So I have figured out what likely happened.

Indeed it was very likely a network congestion because proxmox1 and 
proxmox2 where using a switch and proxmox3 the other, due to proxmox1 
and proxmox2 not having properly loaded the bond-primary directive 
(primary slave not shown on /proc/net/bonding/bond0 although it was 
present in /etc/network/interfaces).

Additionally, just checked out that both switches are linked by a 1G 
port due to the 4th SFP+ port being used for the backup server... 
(against my recommendation during the cluster setup I must add...)

So very likely it was network congestion that kicked proxmox1 out of the 
cluster.

If seems that bond directives should be present in slaves too, like:

auto lo
iface lo inet loopback

iface ens2f0np0 inet manual
     bond-master bond0
     bond-primary ens2f0np1
# Switch2

iface ens2f1np1 inet manual
     bond-master bond0
     bond-primary ens2f0np1
# Switch1

iface eno1 inet manual

iface eno2 inet manual

auto bond0
iface bond0 inet manual
     bond-slaves ens2f0np0 ens2f1np1
     bond-miimon 100
     bond-mode active-backup
     bond-primary ens2f0np1

auto bond0.91
iface bond0.91 inet static
     address 192.168.91.11
#Ceph

auto vmbr0
iface vmbr0 inet static
     address 192.168.90.11
     gateway 192.168.90.1
     bridge-ports bond0
     bridge-stp off
     bridge-fd 0

Otherwise, it seems sometimes primary doesn't get configured properly...

Thanks again Michael and Stefan!
Eneko

El 14/4/21 a las 12:12, Eneko Lacunza via pve-user escribió:
> Hi Michael,
>
> El 14/4/21 a las 11:21, Michael Rasmussen via pve-user escribió:
>> On Wed, 14 Apr 2021 11:04:10 +0200
>> Eneko Lacunza via pve-user<pve-user at lists.proxmox.com> wrote:
>>
>>> Hi all,
>>>
>>> Yesterday we had a strange fence happen in a PVE 6.2 cluster.
>>>
>>> Cluster has 3 nodes (proxmox1, proxmox2, proxmox3) and has been
>>> operating normally for a year. Last update was on January 21st 2021.
>>> Storage is Ceph and nodes are connected to the same network switch
>>> with active-pasive bonds.
>>>
>>> proxmox1 was fenced and automatically rebooted, then everything
>>> recovered. HA restarted VMs in other nodes too.
>>>
>>> proxmox1 syslog: (no network link issues reported at device level)
>> I have seen this occasionally and every time the cause was high network
>> load/network congestion which caused token timeout. The default token
>> timeout in corosync IMHO is very optimistically configured to 1000 ms
>> so I have changed this setting to 5000 ms and after I have done this I
>> have never seen fencing happening caused by network load/network
>> congestion again. You could try this and see if that helps you.
>>
>> PS. my cluster communication is on a dedicated gb bonded vlan.
> Thanks for the info. In this case network is 10Gbit (I see I didn't 
> include this info) but only for proxmox nodes:
>
> - We have 2 Dell N1124T 24x1Gbit 4xSFP+ switches
> - Both switches are interconnected with a SFP+ DAC
> - Active-passive Bonds in each proxmox node go one SFP+ interface on 
> each switch. Primary interfaces are configured to be on the same switch.
> - Connectivity to the LAN is done with 1 Gbit link
> - Proxmox 2x10G Bond is used for VM networking and Ceph public/private 
> networks.
>
> I wouldn't expect high network load/congestion because it's on an 
> internal LAN, with 1Gbit clients. No Ceph issues/backfilling were 
> ocurring during the fence.
>
> Network cards are Broadcom.
>
> Thanks 

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/