[PVE-User] PVE 6.2 Strange cluster node fence
Eneko Lacunza
elacunza at binovo.es
Wed Apr 14 18:26:08 CEST 2021
Hi,
So I have figured out what likely happened.
Indeed it was very likely a network congestion because proxmox1 and
proxmox2 where using a switch and proxmox3 the other, due to proxmox1
and proxmox2 not having properly loaded the bond-primary directive
(primary slave not shown on /proc/net/bonding/bond0 although it was
present in /etc/network/interfaces).
Additionally, just checked out that both switches are linked by a 1G
port due to the 4th SFP+ port being used for the backup server...
(against my recommendation during the cluster setup I must add...)
So very likely it was network congestion that kicked proxmox1 out of the
cluster.
If seems that bond directives should be present in slaves too, like:
auto lo
iface lo inet loopback
iface ens2f0np0 inet manual
bond-master bond0
bond-primary ens2f0np1
# Switch2
iface ens2f1np1 inet manual
bond-master bond0
bond-primary ens2f0np1
# Switch1
iface eno1 inet manual
iface eno2 inet manual
auto bond0
iface bond0 inet manual
bond-slaves ens2f0np0 ens2f1np1
bond-miimon 100
bond-mode active-backup
bond-primary ens2f0np1
auto bond0.91
iface bond0.91 inet static
address 192.168.91.11
#Ceph
auto vmbr0
iface vmbr0 inet static
address 192.168.90.11
gateway 192.168.90.1
bridge-ports bond0
bridge-stp off
bridge-fd 0
Otherwise, it seems sometimes primary doesn't get configured properly...
Thanks again Michael and Stefan!
Eneko
El 14/4/21 a las 12:12, Eneko Lacunza via pve-user escribió:
> Hi Michael,
>
> El 14/4/21 a las 11:21, Michael Rasmussen via pve-user escribió:
>> On Wed, 14 Apr 2021 11:04:10 +0200
>> Eneko Lacunza via pve-user<pve-user at lists.proxmox.com> wrote:
>>
>>> Hi all,
>>>
>>> Yesterday we had a strange fence happen in a PVE 6.2 cluster.
>>>
>>> Cluster has 3 nodes (proxmox1, proxmox2, proxmox3) and has been
>>> operating normally for a year. Last update was on January 21st 2021.
>>> Storage is Ceph and nodes are connected to the same network switch
>>> with active-pasive bonds.
>>>
>>> proxmox1 was fenced and automatically rebooted, then everything
>>> recovered. HA restarted VMs in other nodes too.
>>>
>>> proxmox1 syslog: (no network link issues reported at device level)
>> I have seen this occasionally and every time the cause was high network
>> load/network congestion which caused token timeout. The default token
>> timeout in corosync IMHO is very optimistically configured to 1000 ms
>> so I have changed this setting to 5000 ms and after I have done this I
>> have never seen fencing happening caused by network load/network
>> congestion again. You could try this and see if that helps you.
>>
>> PS. my cluster communication is on a dedicated gb bonded vlan.
> Thanks for the info. In this case network is 10Gbit (I see I didn't
> include this info) but only for proxmox nodes:
>
> - We have 2 Dell N1124T 24x1Gbit 4xSFP+ switches
> - Both switches are interconnected with a SFP+ DAC
> - Active-passive Bonds in each proxmox node go one SFP+ interface on
> each switch. Primary interfaces are configured to be on the same switch.
> - Connectivity to the LAN is done with 1 Gbit link
> - Proxmox 2x10G Bond is used for VM networking and Ceph public/private
> networks.
>
> I wouldn't expect high network load/congestion because it's on an
> internal LAN, with 1Gbit clients. No Ceph issues/backfilling were
> ocurring during the fence.
>
> Network cards are Broadcom.
>
> Thanks
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project
Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
More information about the pve-user
mailing list