[PVE-User] Losing quorum - cluster broken
Nicolas Costes
nicolas.costes at univ-nantes.fr
Wed Apr 22 18:01:16 CEST 2015
Hi again,
I had a 3-node cluster setup and working fine. A couple of months ago, I
upgraded 2 of the 3 hosts (with no VM on them) and the cluster broke.
I have reinstalled from scratch those 2 machines to setup a new cluster. It
worked fine for a couple of hours until I ran "apt-get upgrade " on both nodes
and rebooted them : now the cluster stays up for 5 minutes then I get this
kind of message on both :
Apr 22 17:50:22 hongcha pvedaemon[101559]: ipcc_send_rec failed: Transport
endpoint is not connected
Apr 22 17:50:30 hongcha corosync[102785]: [TOTEM ] Retransmit List: 2b9 2ba
2bb 2bc
[...]
Apr 22 17:51:24 hongcha corosync[102785]: [TOTEM ] Retransmit List: 2de 2df
2d4 2da 2dc 2bd 2cd 2ce 2cf 2d0 2d5 2d6 2d7 2d8 2dd 2b9 2ba 2bb 2bc 2c1 2c2
2c3 2c4 2c9 2ca 2cb 2cc 2d1 2d2 2d3
Apr 22 17:51:24 hongcha corosync[102785]: [TOTEM ] Retransmit List: 2d3 2c7
2c8 2b9 2ba 2bb 2bc 2c1 2c2 2c3 2c4 2c9 2ca 2cb 2cc 2d1 2d2 2d4 2d9 2da 2db
2dc
Then, a couple of minutes later,
Apr 22 17:55:19 hongcha corosync[102785]: [TOTEM ] A processor failed, forming
new configuration.
Apr 22 17:55:21 hongcha corosync[102785]: [CLM ] CLM CONFIGURATION CHANGE
Apr 22 17:55:21 hongcha corosync[102785]: [CLM ] New Configuration:
Apr 22 17:55:21 hongcha corosync[102785]: [CLM ] #011r(0) ip(XXX2)
Apr 22 17:55:21 hongcha corosync[102785]: [CLM ] Members Left:
Apr 22 17:55:21 hongcha corosync[102785]: [CLM ] #011r(0) ip(XXX1)
Apr 22 17:55:21 hongcha corosync[102785]: [CLM ] Members Joined:
Apr 22 17:55:21 hongcha pmxcfs[102955]: [status] notice: node lost quorum
Apr 22 17:55:21 hongcha corosync[102785]: [CMAN ] quorum lost, blocking
activity
Apr 22 17:55:21 hongcha corosync[102785]: [QUORUM] This node is within the
non-primary component and will NOT provide any services.
I can temporarily get the cluster up again with :
# service cman restart
# service pve-cluster restart
Yin:~# pvecm nodes
Node Sts Inc Joined Name
1 M 1204 2015-04-22 17:45:58 yin
2 M 1216 2015-04-22 17:46:04 hongcha
hongcha:~# pvecm nodes
Node Sts Inc Joined Name
1 M 1216 2015-04-22 17:46:04 yin
2 M 1216 2015-04-22 17:46:04 hongcha
yin:~# pvecm status
Version: 6.2.0
Config Version: 2
Cluster Name: XXXX
Cluster Id: 52565
Cluster Member: Yes
Cluster Generation: 1216
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: yin
Node ID: 1
Multicast addresses: 239.192.205.35
Node addresses: XXX1
hongcha:~# pvecm status
Version: 6.2.0
Config Version: 2
Cluster Name: XXXX
Cluster Id: 52565
Cluster Member: Yes
Cluster Generation: 1216
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: hongcha
Node ID: 2
Multicast addresses: 239.192.205.35
Node addresses: XXX2
Both node have their admin interface "vmbr0" tied to "eth0" and plugged into a
vlan-capable switch. "Ip igmp" is activated globally and on the relevant vlan.
# show ip igmp snooping groups
Vlan IP Address Querier Ports
---- ------------ -------- -------------
zzz 239.192.205. No gi1/0/5
35
Any idea why this update breaks the cluster in a reproducible way ? How can I
fix this ? Thanks in advance.
--
Nicolas Costes
Responsable de parc informatique
IUT de la Roche-sur-Yon
Université de Nantes
Tél.: 02 51 47 40 29
More information about the pve-user
mailing list