[PVE-User] Adding a cluster node breaks whole cluster

Sten Aus sten.aus at eenet.ee
Thu Apr 9 11:29:47 CEST 2015


Hi

I have a cluster with 13 working nodes in a dedicated VLAN. Using 4 
switches - DELL 10G, NetExtreme 10G and 2xNetgear 1G for some nodes with 
1G interfaces. We're using latest Proxmox with all updated packages. 
There's one difference, though - 3 nodes use 2.6.32-34-pve kernel as 
they had IPv6 issues with the latest kernel (2.6.32-37-pve - working on 
other nodes).

Everything is working good until I try to add a new node. As soon as I 
do that, whole GUI breaks (KVM stays working, luckily) and "all hell 
breaks loose," as it's said.

So, we have eliminated network card issues - as this problem occurs with 
different network cards. We have eliminated switches' issues, because 
all switches are working prior to this situation AND we have tried to 
use 10GB->1GB gbic module to connect this new node to 10G switch as well.
Now, we have eliminated this Fujitsu hardware totally, because a HP 
machine also breaks the cluster.

IGMP snooping is disabled, multicast is working on both sides, tested 
with ssmping.

*clustat* shows that all nodes are online.

*pvecm nodes* shows that everything is OK. All nodes have "join" time 
and "M" in Sts column. "Inc" differs, though.

*tcpdump* shows:
> 12:15:57.535798 IP 0.0.0.0 > all-systems.mcast.net: igmp query v2
> 12:15:57.535831 IP6 101:80a:30b:6e28:cd3:1d7f:2f00:0 > ff02::1: HBH 
> ICMP6, multicast listener querymax resp delay: 1000 addr: ::, length 24
> 12:15:57.540356 IP 0.0.0.0 > all-systems.mcast.net: igmp query v2
> 12:15:57.540384 IP6 101:80a:21ee:154d:100:: > ff02::1: HBH ICMP6, 
> multicast listener querymax resp delay: 1000 addr: ::, length 24
> 12:15:57.580874 IP 0.0.0.0 > all-systems.mcast.net: igmp query v2
> 12:15:57.580903 IP6 10::40:918f:a47f:0 > ff02::1: HBH ICMP6, multicast 
> listener querymax resp delay: 1000 addr: ::, length 24
> 12:15:58.349706 IP valitseja.5404 > harija1.5405: UDP, length 107
> 12:15:58.349783 IP harija1.5404 > ve-1.5405: UDP, length 617
> 12:16:10.980002 ARP, Reply ve-1 is-at 90:e2:ba:3a:6e:d0 (oui Unknown),
> length 42 

Output from log files:
> Apr 09 11:25:26 corosync [QUORUM] Members[14]: 1 2 3 4 5 6 7 8 9 10 11 
> 13 14 15 
> Apr  9 11:30:27 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> 2960
> Apr  9 11:30:28 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> 2970
> Apr  9 11:30:29 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> 2980
> Apr  9 11:30:30 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> 2990
> Apr  9 11:30:31 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> 3000
> Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> 3010
> Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: 
> cpg_send_message failed: 9
> Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: 
> cpg_send_message failed: 9
> Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: 
> cpg_send_message failed: 9
> Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: 
> cpg_send_message failed: 9
> Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: 
> cpg_send_message failed: 9
> Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: 
> cpg_send_message failed: 9
> Apr  9 11:30:33 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> 3020
> Apr  9 11:30:34 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> 3030
> Apr  9 11:30:35 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> 3040
> Apr  9 11:30:36 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> 3050 

I have read that Proxmox tests with 16 working nodes, but there are 
information that someone uses it with more than 16. Although - I have 
plenty to go? Of course we have had nodes, which are not in cluster 
anymore (deleted), but I assume that they don't count. :)

Any ideas where to look next?

All the best
Sten
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.proxmox.com/pipermail/pve-user/attachments/20150409/102c8012/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3242 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.proxmox.com/pipermail/pve-user/attachments/20150409/102c8012/attachment.bin>


More information about the pve-user mailing list