[PVE-User] PVE 6.2 Strange cluster node fence
Eneko Lacunza
elacunza at binovo.es
Thu Apr 15 09:55:42 CEST 2021
Hi Stefan,
El 14/4/21 a las 19:28, Stefan M. Radman escribió:
> The redundant corosync rings would definitely have prevented the
> fencing even in your scenario.
Yes that's for sure ;)
>
> As a final note you should also consider replacing that 1GbE link
> between the switches by an Nx1GbE bundle (LACP) for redundancy and
> bandwidth reasons or at least by 2 x 1GbE secured by spanning tree (RSTP).
I think we should interlink the switches with SFP+. Backups don't need
that bandwith but the final say is not mine :(
Thanks a lot
Eneko
>
> Stefan
>
>> On Apr 14, 2021, at 18:26, Eneko Lacunza via pve-user
>> <pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>> wrote:
>>
>>
>> *From: *Eneko Lacunza <elacunza at binovo.es <mailto:elacunza at binovo.es>>
>> *Subject: **Re: [PVE-User] PVE 6.2 Strange cluster node fence*
>> *Date: *April 14, 2021 at 18:26:08 GMT+2
>> *To: *pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>
>>
>>
>> Hi,
>>
>> So I have figured out what likely happened.
>>
>> Indeed it was very likely a network congestion because proxmox1 and
>> proxmox2 where using a switch and proxmox3 the other, due to proxmox1
>> and proxmox2 not having properly loaded the bond-primary directive
>> (primary slave not shown on /proc/net/bonding/bond0 although it was
>> present in /etc/network/interfaces).
>>
>> Additionally, just checked out that both switches are linked by a 1G
>> port due to the 4th SFP+ port being used for the backup server...
>> (against my recommendation during the cluster setup I must add...)
>>
>> So very likely it was network congestion that kicked proxmox1 out of
>> the cluster.
>>
>> If seems that bond directives should be present in slaves too, like:
>>
>> auto lo
>> iface lo inet loopback
>>
>> iface ens2f0np0 inet manual
>> bond-master bond0
>> bond-primary ens2f0np1
>> # Switch2
>>
>> iface ens2f1np1 inet manual
>> bond-master bond0
>> bond-primary ens2f0np1
>> # Switch1
>>
>> iface eno1 inet manual
>>
>> iface eno2 inet manual
>>
>> auto bond0
>> iface bond0 inet manual
>> bond-slaves ens2f0np0 ens2f1np1
>> bond-miimon 100
>> bond-mode active-backup
>> bond-primary ens2f0np1
>>
>> auto bond0.91
>> iface bond0.91 inet static
>> address 192.168.91.11
>> #Ceph
>>
>> auto vmbr0
>> iface vmbr0 inet static
>> address 192.168.90.11
>> gateway 192.168.90.1
>> bridge-ports bond0
>> bridge-stp off
>> bridge-fd 0
>>
>> Otherwise, it seems sometimes primary doesn't get configured properly...
>>
>> Thanks again Michael and Stefan!
>> Eneko
>>
>>
>> El 14/4/21 a las 12:12, Eneko Lacunza via pve-user escribió:
>>> Hi Michael,
>>>
>>> El 14/4/21 a las 11:21, Michael Rasmussen via pve-user escribió:
>>>> On Wed, 14 Apr 2021 11:04:10 +0200
>>>> Eneko Lacunza via pve-user<pve-user at lists.proxmox.com
>>>> <mailto:pve-user at lists.proxmox.com>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Yesterday we had a strange fence happen in a PVE 6.2 cluster.
>>>>>
>>>>> Cluster has 3 nodes (proxmox1, proxmox2, proxmox3) and has been
>>>>> operating normally for a year. Last update was on January 21st 2021.
>>>>> Storage is Ceph and nodes are connected to the same network switch
>>>>> with active-pasive bonds.
>>>>>
>>>>> proxmox1 was fenced and automatically rebooted, then everything
>>>>> recovered. HA restarted VMs in other nodes too.
>>>>>
>>>>> proxmox1 syslog: (no network link issues reported at device level)
>>>> I have seen this occasionally and every time the cause was high network
>>>> load/network congestion which caused token timeout. The default token
>>>> timeout in corosync IMHO is very optimistically configured to 1000 ms
>>>> so I have changed this setting to 5000 ms and after I have done this I
>>>> have never seen fencing happening caused by network load/network
>>>> congestion again. You could try this and see if that helps you.
>>>>
>>>> PS. my cluster communication is on a dedicated gb bonded vlan.
>>> Thanks for the info. In this case network is 10Gbit (I see I didn't
>>> include this info) but only for proxmox nodes:
>>>
>>> - We have 2 Dell N1124T 24x1Gbit 4xSFP+ switches
>>> - Both switches are interconnected with a SFP+ DAC
>>> - Active-passive Bonds in each proxmox node go one SFP+ interface on
>>> each switch. Primary interfaces are configured to be on the same switch.
>>> - Connectivity to the LAN is done with 1 Gbit link
>>> - Proxmox 2x10G Bond is used for VM networking and Ceph
>>> public/private networks.
>>>
>>> I wouldn't expect high network load/congestion because it's on an
>>> internal LAN, with 1Gbit clients. No Ceph issues/backfilling were
>>> ocurring during the fence.
>>>
>>> Network cards are Broadcom.
>>>
>>> Thanks
>>
>> Eneko Lacunza
>> Zuzendari teknikoa | Director técnico
>> Binovo IT Human Project
>>
>> Tel. +34 943 569 206 | https://www.binovo.es <https://www.binovo.es>
>> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>>
>> https://www.youtube.com/user/CANALBINOVO
>> <https://www.youtube.com/user/CANALBINOVO>
>> https://www.linkedin.com/company/37269706/
>>
>>
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user at lists.proxmox.com
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.proxmox.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-user&data=04%7C01%7Csmr%40kmi.com%7C6173285a195944ab306e08d8ff620c61%7Cc2283768b8d34e008f3d85b1b4f03b33%7C0%7C0%7C637540143873213806%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=k%2FL7WhTr4ybZ%2FsKsx%2F49L3k7sjc2VA71xKwI8iH8buw%3D&reserved=0
>
>
> CONFIDENTIALITY NOTICE: /This communication may contain privileged and
> confidential information, or may otherwise be protected from
> disclosure, and is intended solely for use of the intended
> recipient(s). If you are not the intended recipient of this
> communication, please notify the sender that you have received this
> communication in error and delete and destroy all copies in your
> possession. /
>
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project
Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
More information about the pve-user
mailing list