[PVE-User] Cluster disaster

Dhaussy Alexandre ADhaussy at voyages-sncf.com
Tue Nov 22 17:35:08 CET 2016


...sequel to those thrilling adventures...
I _still_ have problems with nodes not joining the cluster properly after rebooting...

Here's what we have done last night :

- Stopped ALL VMs (just to ensure no corruption happen in case of unexpected reboots...)
- Patched qemu from 2.6.1 to 2.6.2 to fix live migration issues.
- Removed bridge (cluster network) on all nodes to fix multicast issues (11 nodes total.)
- Patched all (HP blade/HP ILO/Ethernet/Fiber Channel card) bios and firmwares (13 nodes total.)
- Rebooted all nodes, one, two, or three server simultaneously.

So far we had absolutly no problems, corosync was still quorate and all nodes leaved and joined the cluster successfully.

- Added 2 nodes to the cluster, no problem at all...
- Started two VMs on two nodes, and to cut the network on those nodes.
- As expected, watchdog did its job killing the two nodes, VMs were relocated.... so far so good !

_Except_, the two nodes were never able to join the cluster again after reboot...

LVM takes so long to scan all PVs/LVs....somehow, i believe, it ends in an inconsistency when systemd starts cluster services.
On the other nodes, i can actually see that corosync does a quick join/leave (and fails) right after booting...

Nov 22 02:07:52 proxmoxt21 corosync[22342]:  [TOTEM ] A new membership (10.98.x.x:1492) was formed. Members joined: 10
Nov 22 02:07:52 proxmoxt21 corosync[22342]:  [TOTEM ] A new membership (10.98.x.x:1496) was formed. Members left: 10
Nov 22 02:07:52 proxmoxt21 corosync[22342]:  [CPG   ] downlist left_list: 0 received in state 2
Nov 22 02:07:52 proxmoxt21 corosync[22342]:  [CPG   ] downlist left_list: 0 received in state 2
Nov 22 02:07:52 proxmoxt21 corosync[22342]:  [CPG   ] downlist left_list: 0 received in state 2
Nov 22 02:07:52 proxmoxt21 corosync[22342]:  [CPG   ] downlist left_list: 0 received in state 2
Nov 22 02:07:52 proxmoxt21 corosync[22342]:  [CPG   ] downlist left_list: 0 received in state 2
Nov 22 02:07:52 proxmoxt21 corosync[22342]:  [CPG   ] downlist left_list: 0 received in state 2
Nov 22 02:07:52 proxmoxt21 corosync[22342]:  [CPG   ] downlist left_list: 0 received in state 2
Nov 22 02:07:52 proxmoxt21 corosync[22342]:  [CPG   ] downlist left_list: 0 received in state 2
Nov 22 02:07:52 proxmoxt21 corosync[22342]:  [CPG   ] downlist left_list: 0 received in state 2
Nov 22 02:07:52 proxmoxt21 corosync[22342]:  [QUORUM] Members[10]: 9 11 5 4 12 3 1 2 6 8
Nov 22 02:07:52 proxmoxt21 corosync[22342]:  [MAIN  ] Completed service synchronization, ready to provide service.

I tried several reboots...same problem. :(
I ended up removing the two freshly added nodes from the cluster, and restarted all VMs.

I don't know how, but i feel that every node i add to the cluster currently slows down LVM scan a little more...until it ends up interfering with cluster services at boot...
Recall that i have about 1500Vms, 1600LVs, 70PVs on external SAN storage...

_Now_ i have a serious lead that this issue could be related to a known racing condition between udev and multipath.
I have had this issue previously, but i didnt think i would interact and cause issues with cluster services...what do you think ?
See the https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=799781

I quickly tried the workaround suggested here : https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=799781#32
(remove this rule from udev : ACTION=="add|change", SUBSYSTEM=="block", RUN+="/sbin/multipath -v0 /dev/$name")

I can tell it boots _much_ faster, but i will need to give another try and proper testing to see if it fix my issue...
Anyhow, i'm open to suggestions or thoughts that could enlighten me...

(And sorry for the long story)

Le 14/11/2016 à 12:33, Thomas Lamprecht a écrit :


On 14.11.2016 11:50, Dhaussy Alexandre wrote:

Le 11/11/2016 à 19:43, Dietmar Maurer a écrit :
On November 11, 2016 at 6:41 PM Dhaussy Alexandre
<ADhaussy at voyages-sncf.com><mailto:ADhaussy at voyages-sncf.com> wrote:
you lost quorum, and the watchdog expired - that is how the watchdog
based fencing works.
I don't expect to loose quorum when _one_ node joins or leave the cluster.
This was probably a long time before - but I have not read through the whole
logs ...
That makes no sense to me..
The fact is : everything have been working fine for weeks.


What i can see in the logs is : several reboots of cluster nodes
suddently, and exactly one minute after one node joining and/or leaving
the cluster.

The watchdog is set to an 60 second timeout, meaning that cluster leave caused
quorum loss, or other problems (you said you had multicast problems around that
time) thus the LRM stopped updating the watchdog, so one minute later it resetted
all nodes, which left the quorate partition.

I see no problems with corosync/lrm/crm before that.
This leads me to a probable network (multicast) malfunction.

I did a bit of homeworks reading the wiki about ha manager..

What i understand so far, is that every state/service change from LRM
must be acknowledged (cluster-wise) by CRM master.

Yes and no, LRM and CRM are two state machines with synced inputs,
but that holds mainly for human triggered commands and the resulting
communication.
Meaning that commands like start, stop, migrate may not go through from
the CRM to the LRM. Fencing and such stuff works none the less, else it
would be a major design flaw :)

So if a multicast disruption occurs, and i assume LRM wouldn't be able
talk to the CRM MASTER, then it also couldn't reset the watchdog, am i
right ?



No, the watchdog runs on each node and is CRM independent.
As watchdogs are normally not able to server more clients we wrote
the watchdog-mux (multiplexer).
This is a very simple C program which opens the watchdog with a
60 second timeout and allows multiple clients (at the moment CRM
and LRM) to connect to it.
If a client does not resets the dog for about 10 seconds, IIRC, the
watchdox-mux disables watchdogs updates on the real watchdog.
After that a node reset will happen *when* the dog runs out of time,
not instantly.

So if the LRM cannot communicate (i.e. has no quorum) he will stop
updating the dog, thus trigger independent what the CRM says or does.


Another thing ; i have checked my network configuration, the cluster ip
is set on a linux bridge...
By default multicast_snooping is set to 1 on linux bridge, so i think it
there's a good chance this is the source of my problems...
Note that we don't use IGMP snooping, it is disabled on almost all
network switchs.


Yes, multicast snooping has to be configured (recommended) or else turned off on the switch.
That's stated in some wiki articles, various forum posts and our docs, here:
http://pve.proxmox.com/pve-docs/chapter-pvecm.html#cluster-network-requirements

Hope that helps a bit understanding. :)

cheers,
Thomas

Plus i found a post by A.Derumier (yes, 3 years old..) He did have
similar issues with bridge and multicast.
http://pve.proxmox.com/pipermail/pve-devel/2013-March/006678.html
_______________________________________________
pve-user mailing list
pve-user at pve.proxmox.com<mailto:pve-user at pve.proxmox.com>
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user



More information about the pve-user mailing list