[PVE-User] Adding a cluster node breaks whole cluster
Dietmar Maurer
dietmar at proxmox.com
Thu Apr 9 12:54:35 CEST 2015
> Everything is working good until I try to add a new node. As soon as I
> do that, whole GUI breaks (KVM stays working, luckily) and "all hell
> breaks loose," as it's said.
>
> So, we have eliminated network card issues - as this problem occurs with
> different network cards. We have eliminated switches' issues, because
> all switches are working prior to this situation AND we have tried to
> use 10GB->1GB gbic module to connect this new node to 10G switch as well.
> Now, we have eliminated this Fujitsu hardware totally, because a HP
> machine also breaks the cluster.
>
> IGMP snooping is disabled, multicast is working on both sides, tested
> with ssmping.
>
> *clustat* shows that all nodes are online.
>
> *pvecm nodes* shows that everything is OK. All nodes have "join" time
> and "M" in Sts column. "Inc" differs, though.
>
> *tcpdump* shows:
> > 12:15:57.535798 IP 0.0.0.0 > all-systems.mcast.net: igmp query v2
> > 12:15:57.535831 IP6 101:80a:30b:6e28:cd3:1d7f:2f00:0 > ff02::1: HBH
> > ICMP6, multicast listener querymax resp delay: 1000 addr: ::, length 24
> > 12:15:57.540356 IP 0.0.0.0 > all-systems.mcast.net: igmp query v2
> > 12:15:57.540384 IP6 101:80a:21ee:154d:100:: > ff02::1: HBH ICMP6,
> > multicast listener querymax resp delay: 1000 addr: ::, length 24
> > 12:15:57.580874 IP 0.0.0.0 > all-systems.mcast.net: igmp query v2
> > 12:15:57.580903 IP6 10::40:918f:a47f:0 > ff02::1: HBH ICMP6, multicast
> > listener querymax resp delay: 1000 addr: ::, length 24
> > 12:15:58.349706 IP valitseja.5404 > harija1.5405: UDP, length 107
> > 12:15:58.349783 IP harija1.5404 > ve-1.5405: UDP, length 617
> > 12:16:10.980002 ARP, Reply ve-1 is-at 90:e2:ba:3a:6e:d0 (oui Unknown),
> > length 42
>
> Output from log files:
> > Apr 09 11:25:26 corosync [QUORUM] Members[14]: 1 2 3 4 5 6 7 8 9 10 11
> > 13 14 15
> > Apr 9 11:30:27 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry
> > 2960
> > Apr 9 11:30:28 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry
> > 2970
> > Apr 9 11:30:29 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry
> > 2980
> > Apr 9 11:30:30 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry
> > 2990
> > Apr 9 11:30:31 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry
> > 3000
> > Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry
> > 3010
> > Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit:
> > cpg_send_message failed: 9
> > Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit:
> > cpg_send_message failed: 9
> > Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit:
> > cpg_send_message failed: 9
> > Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit:
> > cpg_send_message failed: 9
> > Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit:
> > cpg_send_message failed: 9
> > Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit:
> > cpg_send_message failed: 9
> > Apr 9 11:30:33 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry
> > 3020
> > Apr 9 11:30:34 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry
> > 3030
> > Apr 9 11:30:35 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry
> > 3040
> > Apr 9 11:30:36 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry
> > 3050
>
> I have read that Proxmox tests with 16 working nodes, but there are
> information that someone uses it with more than 16. Although - I have
> plenty to go? Of course we have had nodes, which are not in cluster
> anymore (deleted), but I assume that they don't count. :)
>
> Any ideas where to look next?
Does it help if you restart pve-cluster service on those nodes:
# service pve-cluster restart
More information about the pve-user
mailing list