[PVE-User] Cluster disaster
ADhaussy at voyages-sncf.com
Mon Nov 14 11:50:57 CET 2016
Le 11/11/2016 à 19:43, Dietmar Maurer a écrit :
> On November 11, 2016 at 6:41 PM Dhaussy Alexandre
> <ADhaussy at voyages-sncf.com> wrote:
>>> you lost quorum, and the watchdog expired - that is how the watchdog
>>> based fencing works.
>> I don't expect to loose quorum when _one_ node joins or leave the cluster.
> This was probably a long time before - but I have not read through the whole
> logs ...
That makes no sense to me..
The fact is : everything have been working fine for weeks.
What i can see in the logs is : several reboots of cluster nodes
suddently, and exactly one minute after one node joining and/or leaving
I see no problems with corosync/lrm/crm before that.
This leads me to a probable network (multicast) malfunction.
I did a bit of homeworks reading the wiki about ha manager..
What i understand so far, is that every state/service change from LRM
must be acknowledged (cluster-wise) by CRM master.
So if a multicast disruption occurs, and i assume LRM wouldn't be able
talk to the CRM MASTER, then it also couldn't reset the watchdog, am i
Another thing ; i have checked my network configuration, the cluster ip
is set on a linux bridge...
By default multicast_snooping is set to 1 on linux bridge, so i think it
there's a good chance this is the source of my problems...
Note that we don't use IGMP snooping, it is disabled on almost all
Plus i found a post by A.Derumier (yes, 3 years old..) He did have
similar issues with bridge and multicast.
More information about the pve-user