[PVE-User] Cluster disaster

Mon Nov 14 11:50:57 CET 2016

Le 11/11/2016 à 19:43, Dietmar Maurer a écrit :
> On November 11, 2016 at 6:41 PM Dhaussy Alexandre 
> <ADhaussy at voyages-sncf.com> wrote:
>>> you lost quorum, and the watchdog expired - that is how the watchdog
>>> based fencing works.
>> I don't expect to loose quorum when _one_ node joins or leave the cluster.
> This was probably a long time before - but I have not read through the whole
> logs ...
That makes no sense to me..
The fact is : everything have been working fine for weeks.

What i can see in the logs is : several reboots of cluster nodes 
suddently, and exactly one minute after one node joining and/or leaving 
the cluster.
I see no problems with corosync/lrm/crm before that.
This leads me to a probable network (multicast) malfunction.

I did a bit of homeworks reading the wiki about ha manager..

What i understand so far, is that every state/service change from LRM 
must be acknowledged (cluster-wise) by CRM master.
So if a multicast disruption occurs, and i assume LRM wouldn't be able 
talk to the CRM MASTER, then it also couldn't reset the watchdog, am i 
right ?

Another thing ; i have checked my network configuration, the cluster ip 
is set on a linux bridge...
By default multicast_snooping is set to 1 on linux bridge, so i think it 
there's a good chance this is the source of my problems...
Note that we don't use IGMP snooping, it is disabled on almost all 
network switchs.

Plus i found a post by A.Derumier (yes, 3 years old..) He did have 
similar issues with bridge and multicast.
http://pve.proxmox.com/pipermail/pve-devel/2013-March/006678.html