[PVE-User] PVE 6.2 Strange cluster node fence

Thu Apr 15 09:55:42 CEST 2021

Hi Stefan,

El 14/4/21 a las 19:28, Stefan M. Radman escribió:
> The redundant corosync rings would definitely have prevented the 
> fencing even in your scenario. 

Yes that's for sure ;)
>
> As a final note you should also consider replacing that 1GbE link 
> between the switches by an Nx1GbE bundle (LACP) for redundancy and 
> bandwidth reasons or at least by 2 x 1GbE secured by spanning tree (RSTP).
I think we should interlink the switches with SFP+. Backups don't need 
that bandwith but the final say is not mine :(

Thanks a lot
Eneko

>
> Stefan
>
>> On Apr 14, 2021, at 18:26, Eneko Lacunza via pve-user 
>> <pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>> wrote:
>>
>>
>> *From: *Eneko Lacunza <elacunza at binovo.es <mailto:elacunza at binovo.es>>
>> *Subject: **Re: [PVE-User] PVE 6.2 Strange cluster node fence*
>> *Date: *April 14, 2021 at 18:26:08 GMT+2
>> *To: *pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>
>>
>>
>> Hi,
>>
>> So I have figured out what likely happened.
>>
>> Indeed it was very likely a network congestion because proxmox1 and 
>> proxmox2 where using a switch and proxmox3 the other, due to proxmox1 
>> and proxmox2 not having properly loaded the bond-primary directive 
>> (primary slave not shown on /proc/net/bonding/bond0 although it was 
>> present in /etc/network/interfaces).
>>
>> Additionally, just checked out that both switches are linked by a 1G 
>> port due to the 4th SFP+ port being used for the backup server... 
>> (against my recommendation during the cluster setup I must add...)
>>
>> So very likely it was network congestion that kicked proxmox1 out of 
>> the cluster.
>>
>> If seems that bond directives should be present in slaves too, like:
>>
>> auto lo
>> iface lo inet loopback
>>
>> iface ens2f0np0 inet manual
>>     bond-master bond0
>>     bond-primary ens2f0np1
>> # Switch2
>>
>> iface ens2f1np1 inet manual
>>     bond-master bond0
>>     bond-primary ens2f0np1
>> # Switch1
>>
>> iface eno1 inet manual
>>
>> iface eno2 inet manual
>>
>> auto bond0
>> iface bond0 inet manual
>>     bond-slaves ens2f0np0 ens2f1np1
>>     bond-miimon 100
>>     bond-mode active-backup
>>     bond-primary ens2f0np1
>>
>> auto bond0.91
>> iface bond0.91 inet static
>>     address 192.168.91.11
>> #Ceph
>>
>> auto vmbr0
>> iface vmbr0 inet static
>>     address 192.168.90.11
>>     gateway 192.168.90.1
>>     bridge-ports bond0
>>     bridge-stp off
>>     bridge-fd 0
>>
>> Otherwise, it seems sometimes primary doesn't get configured properly...
>>
>> Thanks again Michael and Stefan!
>> Eneko
>>
>>
>> El 14/4/21 a las 12:12, Eneko Lacunza via pve-user escribió:
>>> Hi Michael,
>>>
>>> El 14/4/21 a las 11:21, Michael Rasmussen via pve-user escribió:
>>>> On Wed, 14 Apr 2021 11:04:10 +0200
>>>> Eneko Lacunza via pve-user<pve-user at lists.proxmox.com 
>>>> <mailto:pve-user at lists.proxmox.com>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Yesterday we had a strange fence happen in a PVE 6.2 cluster.
>>>>>
>>>>> Cluster has 3 nodes (proxmox1, proxmox2, proxmox3) and has been
>>>>> operating normally for a year. Last update was on January 21st 2021.
>>>>> Storage is Ceph and nodes are connected to the same network switch
>>>>> with active-pasive bonds.
>>>>>
>>>>> proxmox1 was fenced and automatically rebooted, then everything
>>>>> recovered. HA restarted VMs in other nodes too.
>>>>>
>>>>> proxmox1 syslog: (no network link issues reported at device level)
>>>> I have seen this occasionally and every time the cause was high network
>>>> load/network congestion which caused token timeout. The default token
>>>> timeout in corosync IMHO is very optimistically configured to 1000 ms
>>>> so I have changed this setting to 5000 ms and after I have done this I
>>>> have never seen fencing happening caused by network load/network
>>>> congestion again. You could try this and see if that helps you.
>>>>
>>>> PS. my cluster communication is on a dedicated gb bonded vlan.
>>> Thanks for the info. In this case network is 10Gbit (I see I didn't 
>>> include this info) but only for proxmox nodes:
>>>
>>> - We have 2 Dell N1124T 24x1Gbit 4xSFP+ switches
>>> - Both switches are interconnected with a SFP+ DAC
>>> - Active-passive Bonds in each proxmox node go one SFP+ interface on 
>>> each switch. Primary interfaces are configured to be on the same switch.
>>> - Connectivity to the LAN is done with 1 Gbit link
>>> - Proxmox 2x10G Bond is used for VM networking and Ceph 
>>> public/private networks.
>>>
>>> I wouldn't expect high network load/congestion because it's on an 
>>> internal LAN, with 1Gbit clients. No Ceph issues/backfilling were 
>>> ocurring during the fence.
>>>
>>> Network cards are Broadcom.
>>>
>>> Thanks
>>
>> Eneko Lacunza
>> Zuzendari teknikoa | Director técnico
>> Binovo IT Human Project
>>
>> Tel. +34 943 569 206 | https://www.binovo.es <https://www.binovo.es>
>> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>>
>> https://www.youtube.com/user/CANALBINOVO 
>> <https://www.youtube.com/user/CANALBINOVO>
>> https://www.linkedin.com/company/37269706/
>>
>>
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user at lists.proxmox.com
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.proxmox.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-user&data=04%7C01%7Csmr%40kmi.com%7C6173285a195944ab306e08d8ff620c61%7Cc2283768b8d34e008f3d85b1b4f03b33%7C0%7C0%7C637540143873213806%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=k%2FL7WhTr4ybZ%2FsKsx%2F49L3k7sjc2VA71xKwI8iH8buw%3D&reserved=0
>
>
> CONFIDENTIALITY NOTICE: /This communication may contain privileged and 
> confidential information, or may otherwise be protected from 
> disclosure, and is intended solely for use of the intended 
> recipient(s). If you are not the intended recipient of this 
> communication, please notify the sender that you have received this 
> communication in error and delete and destroy all copies in your 
> possession. /
>

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/