[PVE-User] Losing quorum - cluster broken

Wed Apr 22 18:01:16 CEST 2015

Hi again,

I had a 3-node cluster setup and working fine. A couple of months ago, I 
upgraded 2 of the 3 hosts (with no VM on them) and the cluster broke.

I have reinstalled from scratch those 2 machines to setup a new cluster. It 
worked fine for a couple of hours until I ran "apt-get upgrade " on both nodes 
and rebooted them : now the cluster stays up for 5 minutes then I get this 
kind of message on both :

Apr 22 17:50:22 hongcha pvedaemon[101559]: ipcc_send_rec failed: Transport 
endpoint is not connected
Apr 22 17:50:30 hongcha corosync[102785]: [TOTEM ] Retransmit List: 2b9 2ba 
2bb 2bc 
[...]
Apr 22 17:51:24 hongcha corosync[102785]: [TOTEM ] Retransmit List: 2de 2df 
2d4 2da 2dc 2bd 2cd 2ce 2cf 2d0 2d5 2d6 2d7 2d8 2dd 2b9 2ba 2bb 2bc 2c1 2c2 
2c3 2c4 2c9 2ca 2cb 2cc 2d1 2d2 2d3
Apr 22 17:51:24 hongcha corosync[102785]: [TOTEM ] Retransmit List: 2d3 2c7 
2c8 2b9 2ba 2bb 2bc 2c1 2c2 2c3 2c4 2c9 2ca 2cb 2cc 2d1 2d2 2d4 2d9 2da 2db 
2dc 

Then, a couple of minutes later, 

Apr 22 17:55:19 hongcha corosync[102785]: [TOTEM ] A processor failed, forming 
new configuration.
Apr 22 17:55:21 hongcha corosync[102785]: [CLM ] CLM CONFIGURATION CHANGE
Apr 22 17:55:21 hongcha corosync[102785]: [CLM ] New Configuration:
Apr 22 17:55:21 hongcha corosync[102785]: [CLM ] #011r(0) ip(XXX2)
Apr 22 17:55:21 hongcha corosync[102785]: [CLM ] Members Left:
Apr 22 17:55:21 hongcha corosync[102785]: [CLM ] #011r(0) ip(XXX1)
Apr 22 17:55:21 hongcha corosync[102785]: [CLM ] Members Joined:
Apr 22 17:55:21 hongcha pmxcfs[102955]: [status] notice: node lost quorum
Apr 22 17:55:21 hongcha corosync[102785]: [CMAN ] quorum lost, blocking 
activity
Apr 22 17:55:21 hongcha corosync[102785]: [QUORUM] This node is within the 
non-primary component and will NOT provide any services.

I can temporarily get the cluster up again with :

# service cman restart
# service pve-cluster restart

Yin:~# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M   1204   2015-04-22 17:45:58  yin
   2   M   1216   2015-04-22 17:46:04  hongcha

hongcha:~# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M   1216   2015-04-22 17:46:04  yin
   2   M   1216   2015-04-22 17:46:04  hongcha

yin:~# pvecm status
Version: 6.2.0
Config Version: 2
Cluster Name: XXXX
Cluster Id: 52565
Cluster Member: Yes
Cluster Generation: 1216
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: yin
Node ID: 1
Multicast addresses: 239.192.205.35 
Node addresses: XXX1

hongcha:~# pvecm status
Version: 6.2.0
Config Version: 2
Cluster Name: XXXX
Cluster Id: 52565
Cluster Member: Yes
Cluster Generation: 1216
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: hongcha
Node ID: 2
Multicast addresses: 239.192.205.35 
Node addresses: XXX2

Both node have their admin interface "vmbr0" tied to "eth0" and plugged into a 
vlan-capable switch. "Ip igmp" is activated globally and on the relevant vlan.

# show ip igmp snooping groups 

Vlan  IP Address  Querier      Ports     
---- ------------ -------- ------------- 
zzz  239.192.205. No       gi1/0/5       
     35              

Any idea why this update breaks the cluster in a reproducible way ? How can I 
fix this ? Thanks in advance.

-- 
Nicolas Costes
Responsable de parc informatique
IUT de la Roche-sur-Yon
Université de Nantes
Tél.: 02 51 47 40 29