[PVE-User] PVE 6.2 Strange cluster node fence
Eneko Lacunza
elacunza at binovo.es
Wed Apr 14 15:18:13 CEST 2021
Hi Stefan,
El 14/4/21 a las 13:22, Stefan M. Radman escribió:
> Hi Eneko
>
> Do you have separate physical interfaces for the cluster (corosync)
> traffic?
No.
> Do you have them on separate VLANs on your switches?
Onyl Ceph traffic is on VLAN91, the rest is untagged.
> Are you running 1 or 2 corosync rings?
This is standard... no hand tuning:
nodelist {
node {
name: proxmox1
nodeid: 2
quorum_votes: 1
ring0_addr: 192.168.90.11
}
node {
name: proxmox2
nodeid: 1
quorum_votes: 1
ring0_addr: 192.168.90.12
}
node {
name: proxmox3
nodeid: 3
quorum_votes: 1
ring0_addr: 192.168.90.13
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: CLUSTERNAME
config_version: 3
interface {
linknumber: 0
}
ip_version: ipv4-6
secauth: on
version: 2
}
>
> Please post your /etc/network/interfaces and explain which interface
> connects where.
auto lo
iface lo inet loopback
iface ens2f0np0 inet manual
# Switch2
iface ens2f1np1 inet manual
# Switch1
iface eno1 inet manual
iface eno2 inet manual
auto bond0
iface bond0 inet manual
bond-slaves ens2f0np0 ens2f1np1
bond-miimon 100
bond-mode active-backup
bond-primary ens2f0np1
auto bond0.91
iface bond0.91 inet static
address 192.168.91.11
#Ceph
auto vmbr0
iface vmbr0 inet static
address 192.168.90.11
gateway 192.168.90.1
bridge-ports bond0
bridge-stp off
bridge-fd 0
Thanks
>
> Thanks
>
> Stefan
>
>
>> On Apr 14, 2021, at 12:12, Eneko Lacunza via pve-user
>> <pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>> wrote:
>>
>>
>> *From: *Eneko Lacunza <elacunza at binovo.es <mailto:elacunza at binovo.es>>
>> *Subject: **Re: [PVE-User] PVE 6.2 Strange cluster node fence*
>> *Date: *April 14, 2021 at 12:12:09 GMT+2
>> *To: *pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>
>>
>>
>> Hi Michael,
>>
>> El 14/4/21 a las 11:21, Michael Rasmussen via pve-user escribió:
>>> On Wed, 14 Apr 2021 11:04:10 +0200
>>> Eneko Lacunza via pve-user<pve-user at lists.proxmox.com
>>> <mailto:pve-user at lists.proxmox.com>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Yesterday we had a strange fence happen in a PVE 6.2 cluster.
>>>>
>>>> Cluster has 3 nodes (proxmox1, proxmox2, proxmox3) and has been
>>>> operating normally for a year. Last update was on January 21st 2021.
>>>> Storage is Ceph and nodes are connected to the same network switch
>>>> with active-pasive bonds.
>>>>
>>>> proxmox1 was fenced and automatically rebooted, then everything
>>>> recovered. HA restarted VMs in other nodes too.
>>>>
>>>> proxmox1 syslog: (no network link issues reported at device level)
>>> I have seen this occasionally and every time the cause was high network
>>> load/network congestion which caused token timeout. The default token
>>> timeout in corosync IMHO is very optimistically configured to 1000 ms
>>> so I have changed this setting to 5000 ms and after I have done this I
>>> have never seen fencing happening caused by network load/network
>>> congestion again. You could try this and see if that helps you.
>>>
>>> PS. my cluster communication is on a dedicated gb bonded vlan.
>> Thanks for the info. In this case network is 10Gbit (I see I didn't
>> include this info) but only for proxmox nodes:
>>
>> - We have 2 Dell N1124T 24x1Gbit 4xSFP+ switches
>> - Both switches are interconnected with a SFP+ DAC
>> - Active-passive Bonds in each proxmox node go one SFP+ interface on
>> each switch. Primary interfaces are configured to be on the same switch.
>> - Connectivity to the LAN is done with 1 Gbit link
>> - Proxmox 2x10G Bond is used for VM networking and Ceph
>> public/private networks.
>>
>> I wouldn't expect high network load/congestion because it's on an
>> internal LAN, with 1Gbit clients. No Ceph issues/backfilling were
>> ocurring during the fence.
>>
>> Network cards are Broadcom.
>>
>> Thanks
>>
>> Eneko Lacunza
>> Zuzendari teknikoa | Director técnico
>> Binovo IT Human Project
>>
>> Tel. +34 943 569 206 | https://www.binovo.es <https://www.binovo.es>
>> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>>
>> https://www.youtube.com/user/CANALBINOVO
>> <https://www.youtube.com/user/CANALBINOVO>
>> https://www.linkedin.com/company/37269706/
>>
>>
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user at lists.proxmox.com
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.proxmox.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-user&data=04%7C01%7Csmr%40kmi.com%7C94935b3774c84a829c8008d8ff2dcd78%7Cc2283768b8d34e008f3d85b1b4f03b33%7C0%7C0%7C637539919485970079%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0Lc31YKv%2Fm4RQEsAZlcdsuA1XidEZEgfmAwRgGT4Dlg%3D&reserved=0
>
>
> CONFIDENTIALITY NOTICE: /This communication may contain privileged and
> confidential information, or may otherwise be protected from
> disclosure, and is intended solely for use of the intended
> recipient(s). If you are not the intended recipient of this
> communication, please notify the sender that you have received this
> communication in error and delete and destroy all copies in your
> possession. /
>
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project
Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
More information about the pve-user
mailing list