[PVE-User] Losing quorum - cluster broken

Sten Aus sten.aus at eenet.ee
Wed Apr 22 21:40:58 CEST 2015


Is your multicast working correctly?
In a matter of a fact, try to disable IGMP snooping, does this solve 
your problem?

What switches are you using? What network cards? I've had similar 
problems, where adding /readding a cluster node breaks quorum totally.

On 22.04.15 19:01, Nicolas Costes wrote:
> Hi again,
>
> I had a 3-node cluster setup and working fine. A couple of months ago, I
> upgraded 2 of the 3 hosts (with no VM on them) and the cluster broke.
>
> I have reinstalled from scratch those 2 machines to setup a new cluster. It
> worked fine for a couple of hours until I ran "apt-get upgrade " on both nodes
> and rebooted them : now the cluster stays up for 5 minutes then I get this
> kind of message on both :
>
> Apr 22 17:50:22 hongcha pvedaemon[101559]: ipcc_send_rec failed: Transport
> endpoint is not connected
> Apr 22 17:50:30 hongcha corosync[102785]: [TOTEM ] Retransmit List: 2b9 2ba
> 2bb 2bc
> [...]
> Apr 22 17:51:24 hongcha corosync[102785]: [TOTEM ] Retransmit List: 2de 2df
> 2d4 2da 2dc 2bd 2cd 2ce 2cf 2d0 2d5 2d6 2d7 2d8 2dd 2b9 2ba 2bb 2bc 2c1 2c2
> 2c3 2c4 2c9 2ca 2cb 2cc 2d1 2d2 2d3
> Apr 22 17:51:24 hongcha corosync[102785]: [TOTEM ] Retransmit List: 2d3 2c7
> 2c8 2b9 2ba 2bb 2bc 2c1 2c2 2c3 2c4 2c9 2ca 2cb 2cc 2d1 2d2 2d4 2d9 2da 2db
> 2dc
>
> Then, a couple of minutes later,
>
> Apr 22 17:55:19 hongcha corosync[102785]: [TOTEM ] A processor failed, forming
> new configuration.
> Apr 22 17:55:21 hongcha corosync[102785]: [CLM ] CLM CONFIGURATION CHANGE
> Apr 22 17:55:21 hongcha corosync[102785]: [CLM ] New Configuration:
> Apr 22 17:55:21 hongcha corosync[102785]: [CLM ] #011r(0) ip(XXX2)
> Apr 22 17:55:21 hongcha corosync[102785]: [CLM ] Members Left:
> Apr 22 17:55:21 hongcha corosync[102785]: [CLM ] #011r(0) ip(XXX1)
> Apr 22 17:55:21 hongcha corosync[102785]: [CLM ] Members Joined:
> Apr 22 17:55:21 hongcha pmxcfs[102955]: [status] notice: node lost quorum
> Apr 22 17:55:21 hongcha corosync[102785]: [CMAN ] quorum lost, blocking
> activity
> Apr 22 17:55:21 hongcha corosync[102785]: [QUORUM] This node is within the
> non-primary component and will NOT provide any services.
>
>
>
> I can temporarily get the cluster up again with :
>
> # service cman restart
> # service pve-cluster restart
>
> Yin:~# pvecm nodes
> Node  Sts   Inc   Joined               Name
>     1   M   1204   2015-04-22 17:45:58  yin
>     2   M   1216   2015-04-22 17:46:04  hongcha
>
> hongcha:~# pvecm nodes
> Node  Sts   Inc   Joined               Name
>     1   M   1216   2015-04-22 17:46:04  yin
>     2   M   1216   2015-04-22 17:46:04  hongcha
>
> yin:~# pvecm status
> Version: 6.2.0
> Config Version: 2
> Cluster Name: XXXX
> Cluster Id: 52565
> Cluster Member: Yes
> Cluster Generation: 1216
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 2
> Total votes: 2
> Node votes: 1
> Quorum: 2
> Active subsystems: 5
> Flags:
> Ports Bound: 0
> Node name: yin
> Node ID: 1
> Multicast addresses: 239.192.205.35
> Node addresses: XXX1
>
> hongcha:~# pvecm status
> Version: 6.2.0
> Config Version: 2
> Cluster Name: XXXX
> Cluster Id: 52565
> Cluster Member: Yes
> Cluster Generation: 1216
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 2
> Total votes: 2
> Node votes: 1
> Quorum: 2
> Active subsystems: 5
> Flags:
> Ports Bound: 0
> Node name: hongcha
> Node ID: 2
> Multicast addresses: 239.192.205.35
> Node addresses: XXX2
>
>
> Both node have their admin interface "vmbr0" tied to "eth0" and plugged into a
> vlan-capable switch. "Ip igmp" is activated globally and on the relevant vlan.
>
> # show ip igmp snooping groups
>
> Vlan  IP Address  Querier      Ports
> ---- ------------ -------- -------------
> zzz  239.192.205. No       gi1/0/5
>       35
>
> Any idea why this update breaks the cluster in a reproducible way ? How can I
> fix this ? Thanks in advance.
>
>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3242 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.proxmox.com/pipermail/pve-user/attachments/20150422/10221940/attachment.bin>


More information about the pve-user mailing list