[PVE-User] Adding a cluster node breaks whole cluster

Thu Apr 9 12:54:35 CEST 2015

> Everything is working good until I try to add a new node. As soon as I 
> do that, whole GUI breaks (KVM stays working, luckily) and "all hell 
> breaks loose," as it's said.
> 
> So, we have eliminated network card issues - as this problem occurs with 
> different network cards. We have eliminated switches' issues, because 
> all switches are working prior to this situation AND we have tried to 
> use 10GB->1GB gbic module to connect this new node to 10G switch as well.
> Now, we have eliminated this Fujitsu hardware totally, because a HP 
> machine also breaks the cluster.
> 
> IGMP snooping is disabled, multicast is working on both sides, tested 
> with ssmping.
> 
> *clustat* shows that all nodes are online.
> 
> *pvecm nodes* shows that everything is OK. All nodes have "join" time 
> and "M" in Sts column. "Inc" differs, though.
> 
> *tcpdump* shows:
> > 12:15:57.535798 IP 0.0.0.0 > all-systems.mcast.net: igmp query v2
> > 12:15:57.535831 IP6 101:80a:30b:6e28:cd3:1d7f:2f00:0 > ff02::1: HBH 
> > ICMP6, multicast listener querymax resp delay: 1000 addr: ::, length 24
> > 12:15:57.540356 IP 0.0.0.0 > all-systems.mcast.net: igmp query v2
> > 12:15:57.540384 IP6 101:80a:21ee:154d:100:: > ff02::1: HBH ICMP6, 
> > multicast listener querymax resp delay: 1000 addr: ::, length 24
> > 12:15:57.580874 IP 0.0.0.0 > all-systems.mcast.net: igmp query v2
> > 12:15:57.580903 IP6 10::40:918f:a47f:0 > ff02::1: HBH ICMP6, multicast 
> > listener querymax resp delay: 1000 addr: ::, length 24
> > 12:15:58.349706 IP valitseja.5404 > harija1.5405: UDP, length 107
> > 12:15:58.349783 IP harija1.5404 > ve-1.5405: UDP, length 617
> > 12:16:10.980002 ARP, Reply ve-1 is-at 90:e2:ba:3a:6e:d0 (oui Unknown),
> > length 42 
> 
> Output from log files:
> > Apr 09 11:25:26 corosync [QUORUM] Members[14]: 1 2 3 4 5 6 7 8 9 10 11 
> > 13 14 15 
> > Apr  9 11:30:27 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> > 2960
> > Apr  9 11:30:28 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> > 2970
> > Apr  9 11:30:29 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> > 2980
> > Apr  9 11:30:30 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> > 2990
> > Apr  9 11:30:31 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> > 3000
> > Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> > 3010
> > Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: 
> > cpg_send_message failed: 9
> > Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: 
> > cpg_send_message failed: 9
> > Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: 
> > cpg_send_message failed: 9
> > Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: 
> > cpg_send_message failed: 9
> > Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: 
> > cpg_send_message failed: 9
> > Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: 
> > cpg_send_message failed: 9
> > Apr  9 11:30:33 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> > 3020
> > Apr  9 11:30:34 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> > 3030
> > Apr  9 11:30:35 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> > 3040
> > Apr  9 11:30:36 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
> > 3050 
> 
> I have read that Proxmox tests with 16 working nodes, but there are 
> information that someone uses it with more than 16. Although - I have 
> plenty to go? Of course we have had nodes, which are not in cluster 
> anymore (deleted), but I assume that they don't count. :)
> 
> Any ideas where to look next?

Does it help if you restart pve-cluster service on those nodes:

# service pve-cluster restart