[PVE-User] PVE 6.2 Strange cluster node fence

Eneko Lacunza elacunza at binovo.es
Wed Apr 14 16:07:09 CEST 2021


Hi Stefan,

Thanks for your advice. Seems a really good use for otherwise unused 1G 
ports so I'll look into configuring that.

If nodes had only one 1G interface, would you also une RRP? (one ring on 
1G and the other on 10G bond)

Thanks

El 14/4/21 a las 15:57, Stefan M. Radman escribió:
> Hi Eneko
>
> That’s a nice setup and I bet it works well but you should do some 
> hand-tuning to increase resilience.
>
> Are the unused eno1 and eno2 interfaces on-board 1GbE copper interfaces?
>
> If that’s the case I’d strongly recommend to turn them into dedicated 
> untagged interfaces for the cluster traffic, running on two separate 
> “rings".
>
> https://pve.proxmox.com/wiki/Separate_Cluster_Network 
> <https://pve.proxmox.com/wiki/Separate_Cluster_Network>
> https://pve.proxmox.com/wiki/Separate_Cluster_Network#Redundant_Ring_Protocol 
> <https://pve.proxmox.com/wiki/Separate_Cluster_Network#Redundant_Ring_Protocol>
>
> Create two corosync rings, using isolated VLANs on your two switches 
> e.g. VLAN4001 on Switch1 and VLAN4002 on Switch2.
>
> eno1 => Switch1 => VLAN4001
> eno2 => Switch2 => VLAN4002
>
> Restrict VLAN4001 to the access ports where the eno1 interfaces are 
> connected. Prune VLAN4001 from ALL trunks.
> Restrict VLAN4001 to the access ports where the eno2 interfaces are 
> connected. Prune VLAN4002 from ALL trunks.
> Assign the eno1 and eno2 interfaces to two separate subnets and you 
> are done.
>
> With separate rings you don’t even have to stop your cluster while 
> migrating corosync to the new subnets.
> Just do them one-by-one.
>
> With corosync running on two separate rings isolated from the rest of 
> your network you should not see any further node fencing.
>
> Stefan
>
>> On Apr 14, 2021, at 15:18, Eneko Lacunza <elacunza at binovo.es 
>> <mailto:elacunza at binovo.es>> wrote:
>>
>> Hi Stefan,
>>
>> El 14/4/21 a las 13:22, Stefan M. Radman escribió:
>>> Hi Eneko
>>>
>>> Do you have separate physical interfaces for the cluster (corosync) 
>>> traffic?
>> No.
>>> Do you have them on separate VLANs on your switches?
>> Onyl Ceph traffic is on VLAN91, the rest is untagged.
>>
>>> Are you running 1 or 2 corosync rings?
>> This is standard... no hand tuning:
>>
>> nodelist {
>>   node {
>>     name: proxmox1
>>     nodeid: 2
>>     quorum_votes: 1
>>     ring0_addr: 192.168.90.11
>>   }
>>   node {
>>     name: proxmox2
>>     nodeid: 1
>>     quorum_votes: 1
>>     ring0_addr: 192.168.90.12
>>   }
>>   node {
>>     name: proxmox3
>>     nodeid: 3
>>     quorum_votes: 1
>>     ring0_addr: 192.168.90.13
>>   }
>> }
>>
>> quorum {
>>   provider: corosync_votequorum
>> }
>>
>> totem {
>>   cluster_name: CLUSTERNAME
>>   config_version: 3
>>   interface {
>>     linknumber: 0
>>   }
>>   ip_version: ipv4-6
>>   secauth: on
>>   version: 2
>> }
>>
>>>
>>> Please post your /etc/network/interfaces and explain which interface 
>>> connects where.
>> auto lo
>> iface lo inet loopback
>>
>> iface ens2f0np0 inet manual
>> # Switch2
>>
>> iface ens2f1np1 inet manual
>> # Switch1
>>
>> iface eno1 inet manual
>>
>> iface eno2 inet manual
>>
>> auto bond0
>> iface bond0 inet manual
>>     bond-slaves ens2f0np0 ens2f1np1
>>     bond-miimon 100
>>     bond-mode active-backup
>>     bond-primary ens2f0np1
>>
>> auto bond0.91
>> iface bond0.91 inet static
>>     address 192.168.91.11
>> #Ceph
>>
>> auto vmbr0
>> iface vmbr0 inet static
>>     address 192.168.90.11
>>     gateway 192.168.90.1
>>     bridge-ports bond0
>>     bridge-stp off
>>     bridge-fd 0
>>
>> Thanks
>>>
>>> Thanks
>>>
>>> Stefan
>>>
>>>
>>>> On Apr 14, 2021, at 12:12, Eneko Lacunza via pve-user 
>>>> <pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>> wrote:
>>>>
>>>>
>>>> *From: *Eneko Lacunza <elacunza at binovo.es <mailto:elacunza at binovo.es>>
>>>> *Subject: **Re: [PVE-User] PVE 6.2 Strange cluster node fence*
>>>> *Date: *April 14, 2021 at 12:12:09 GMT+2
>>>> *To: *pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>
>>>>
>>>>
>>>> Hi Michael,
>>>>
>>>> El 14/4/21 a las 11:21, Michael Rasmussen via pve-user escribió:
>>>>> On Wed, 14 Apr 2021 11:04:10 +0200
>>>>> Eneko Lacunza via pve-user<pve-user at lists.proxmox.com 
>>>>> <mailto:pve-user at lists.proxmox.com>>  wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Yesterday we had a strange fence happen in a PVE 6.2 cluster.
>>>>>>
>>>>>> Cluster has 3 nodes (proxmox1, proxmox2, proxmox3) and has been
>>>>>> operating normally for a year. Last update was on January 21st 2021.
>>>>>> Storage is Ceph and nodes are connected to the same network switch
>>>>>> with active-pasive bonds.
>>>>>>
>>>>>> proxmox1 was fenced and automatically rebooted, then everything
>>>>>> recovered. HA restarted VMs in other nodes too.
>>>>>>
>>>>>> proxmox1 syslog: (no network link issues reported at device level)
>>>>> I have seen this occasionally and every time the cause was high 
>>>>> network
>>>>> load/network congestion which caused token timeout. The default token
>>>>> timeout in corosync IMHO is very optimistically configured to 1000 ms
>>>>> so I have changed this setting to 5000 ms and after I have done this I
>>>>> have never seen fencing happening caused by network load/network
>>>>> congestion again. You could try this and see if that helps you.
>>>>>
>>>>> PS. my cluster communication is on a dedicated gb bonded vlan.
>>>> Thanks for the info. In this case network is 10Gbit (I see I didn't 
>>>> include this info) but only for proxmox nodes:
>>>>
>>>> - We have 2 Dell N1124T 24x1Gbit 4xSFP+ switches
>>>> - Both switches are interconnected with a SFP+ DAC
>>>> - Active-passive Bonds in each proxmox node go one SFP+ interface 
>>>> on each switch. Primary interfaces are configured to be on the same 
>>>> switch.
>>>> - Connectivity to the LAN is done with 1 Gbit link
>>>> - Proxmox 2x10G Bond is used for VM networking and Ceph 
>>>> public/private networks.
>>>>
>>>> I wouldn't expect high network load/congestion because it's on an 
>>>> internal LAN, with 1Gbit clients. No Ceph issues/backfilling were 
>>>> ocurring during the fence.
>>>>
>>>> Network cards are Broadcom.
>>>>
>>>> Thanks
>>>>
>>>> Eneko Lacunza
>>>> Zuzendari teknikoa | Director técnico
>>>> Binovo IT Human Project
>>>>
>>>> Tel. +34 943 569 206 | https://www.binovo.es 
>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.binovo.es%2F&data=04%7C01%7Csmr%40kmi.com%7C4398bf34e74d4be5195f08d8ff47c38d%7Cc2283768b8d34e008f3d85b1b4f03b33%7C0%7C0%7C637540030995281826%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=u3I648kqkdmxF8btFzqout2bTlfHed9JjK9Tr8EzB34%3D&reserved=0>
>>>> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>>>>
>>>> https://www.youtube.com/user/CANALBINOVO 
>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fuser%2FCANALBINOVO&data=04%7C01%7Csmr%40kmi.com%7C4398bf34e74d4be5195f08d8ff47c38d%7Cc2283768b8d34e008f3d85b1b4f03b33%7C0%7C0%7C637540030995281826%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=TOacYDsEVcv%2Bw7wxcY7IbLbp8K1VkbtTKXqaf52e76Q%3D&reserved=0>
>>>> https://www.linkedin.com/company/37269706/
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> pve-user mailing list
>>>> pve-user at lists.proxmox.com
>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.proxmox.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-user&data=04%7C01%7Csmr%40kmi.com%7C94935b3774c84a829c8008d8ff2dcd78%7Cc2283768b8d34e008f3d85b1b4f03b33%7C0%7C0%7C637539919485970079%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0Lc31YKv%2Fm4RQEsAZlcdsuA1XidEZEgfmAwRgGT4Dlg%3D&reserved=0
>>>
>>>
>>> CONFIDENTIALITY NOTICE: /This communication may contain privileged 
>>> and confidential information, or may otherwise be protected from 
>>> disclosure, and is intended solely for use of the intended 
>>> recipient(s). If you are not the intended recipient of this 
>>> communication, please notify the sender that you have received this 
>>> communication in error and delete and destroy all copies in your 
>>> possession. /
>>>
>>
>
>
> CONFIDENTIALITY NOTICE: /This communication may contain privileged and 
> confidential information, or may otherwise be protected from 
> disclosure, and is intended solely for use of the intended 
> recipient(s). If you are not the intended recipient of this 
> communication, please notify the sender that you have received this 
> communication in error and delete and destroy all copies in your 
> possession. /
>

      EnekoLacunza

Director Técnico | Zuzendari teknikoa

Binovo IT Human Project

	943 569 206 <tel:943 569 206>

	elacunza at binovo.es <mailto:elacunza at binovo.es>

	binovo.es <//binovo.es>

	Astigarragako Bidea, 2 - 2 izda. Oficina 10-11, 20180 Oiartzun

	
youtube <https://www.youtube.com/user/CANALBINOVO/>	
	linkedin <https://www.linkedin.com/company/37269706/>	



More information about the pve-user mailing list