[PVE-User] PVE 6.2 Strange cluster node fence

Eneko Lacunza elacunza at binovo.es
Wed Apr 14 15:18:13 CEST 2021


Hi Stefan,

El 14/4/21 a las 13:22, Stefan M. Radman escribió:
> Hi Eneko
>
> Do you have separate physical interfaces for the cluster (corosync) 
> traffic?
No.
> Do you have them on separate VLANs on your switches?
Onyl Ceph traffic is on VLAN91, the rest is untagged.

> Are you running 1 or 2 corosync rings?
This is standard... no hand tuning:

nodelist {
   node {
     name: proxmox1
     nodeid: 2
     quorum_votes: 1
     ring0_addr: 192.168.90.11
   }
   node {
     name: proxmox2
     nodeid: 1
     quorum_votes: 1
     ring0_addr: 192.168.90.12
   }
   node {
     name: proxmox3
     nodeid: 3
     quorum_votes: 1
     ring0_addr: 192.168.90.13
   }
}

quorum {
   provider: corosync_votequorum
}

totem {
   cluster_name: CLUSTERNAME
   config_version: 3
   interface {
     linknumber: 0
   }
   ip_version: ipv4-6
   secauth: on
   version: 2
}

>
> Please post your /etc/network/interfaces and explain which interface 
> connects where.
auto lo
iface lo inet loopback

iface ens2f0np0 inet manual
# Switch2

iface ens2f1np1 inet manual
# Switch1

iface eno1 inet manual

iface eno2 inet manual

auto bond0
iface bond0 inet manual
     bond-slaves ens2f0np0 ens2f1np1
     bond-miimon 100
     bond-mode active-backup
     bond-primary ens2f0np1

auto bond0.91
iface bond0.91 inet static
     address 192.168.91.11
#Ceph

auto vmbr0
iface vmbr0 inet static
     address 192.168.90.11
     gateway 192.168.90.1
     bridge-ports bond0
     bridge-stp off
     bridge-fd 0

Thanks
>
> Thanks
>
> Stefan
>
>
>> On Apr 14, 2021, at 12:12, Eneko Lacunza via pve-user 
>> <pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>> wrote:
>>
>>
>> *From: *Eneko Lacunza <elacunza at binovo.es <mailto:elacunza at binovo.es>>
>> *Subject: **Re: [PVE-User] PVE 6.2 Strange cluster node fence*
>> *Date: *April 14, 2021 at 12:12:09 GMT+2
>> *To: *pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>
>>
>>
>> Hi Michael,
>>
>> El 14/4/21 a las 11:21, Michael Rasmussen via pve-user escribió:
>>> On Wed, 14 Apr 2021 11:04:10 +0200
>>> Eneko Lacunza via pve-user<pve-user at lists.proxmox.com 
>>> <mailto:pve-user at lists.proxmox.com>>  wrote:
>>>
>>>> Hi all,
>>>>
>>>> Yesterday we had a strange fence happen in a PVE 6.2 cluster.
>>>>
>>>> Cluster has 3 nodes (proxmox1, proxmox2, proxmox3) and has been
>>>> operating normally for a year. Last update was on January 21st 2021.
>>>> Storage is Ceph and nodes are connected to the same network switch
>>>> with active-pasive bonds.
>>>>
>>>> proxmox1 was fenced and automatically rebooted, then everything
>>>> recovered. HA restarted VMs in other nodes too.
>>>>
>>>> proxmox1 syslog: (no network link issues reported at device level)
>>> I have seen this occasionally and every time the cause was high network
>>> load/network congestion which caused token timeout. The default token
>>> timeout in corosync IMHO is very optimistically configured to 1000 ms
>>> so I have changed this setting to 5000 ms and after I have done this I
>>> have never seen fencing happening caused by network load/network
>>> congestion again. You could try this and see if that helps you.
>>>
>>> PS. my cluster communication is on a dedicated gb bonded vlan.
>> Thanks for the info. In this case network is 10Gbit (I see I didn't 
>> include this info) but only for proxmox nodes:
>>
>> - We have 2 Dell N1124T 24x1Gbit 4xSFP+ switches
>> - Both switches are interconnected with a SFP+ DAC
>> - Active-passive Bonds in each proxmox node go one SFP+ interface on 
>> each switch. Primary interfaces are configured to be on the same switch.
>> - Connectivity to the LAN is done with 1 Gbit link
>> - Proxmox 2x10G Bond is used for VM networking and Ceph 
>> public/private networks.
>>
>> I wouldn't expect high network load/congestion because it's on an 
>> internal LAN, with 1Gbit clients. No Ceph issues/backfilling were 
>> ocurring during the fence.
>>
>> Network cards are Broadcom.
>>
>> Thanks
>>
>> Eneko Lacunza
>> Zuzendari teknikoa | Director técnico
>> Binovo IT Human Project
>>
>> Tel. +34 943 569 206 | https://www.binovo.es <https://www.binovo.es>
>> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>>
>> https://www.youtube.com/user/CANALBINOVO 
>> <https://www.youtube.com/user/CANALBINOVO>
>> https://www.linkedin.com/company/37269706/
>>
>>
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user at lists.proxmox.com
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.proxmox.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-user&data=04%7C01%7Csmr%40kmi.com%7C94935b3774c84a829c8008d8ff2dcd78%7Cc2283768b8d34e008f3d85b1b4f03b33%7C0%7C0%7C637539919485970079%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0Lc31YKv%2Fm4RQEsAZlcdsuA1XidEZEgfmAwRgGT4Dlg%3D&reserved=0
>
>
> CONFIDENTIALITY NOTICE: /This communication may contain privileged and 
> confidential information, or may otherwise be protected from 
> disclosure, and is intended solely for use of the intended 
> recipient(s). If you are not the intended recipient of this 
> communication, please notify the sender that you have received this 
> communication in error and delete and destroy all copies in your 
> possession. /
>

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/




More information about the pve-user mailing list