<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hi<br>
<br>
I have a cluster with 13 working nodes in a dedicated VLAN. Using 4
switches - DELL 10G, NetExtreme 10G and 2xNetgear 1G for some nodes
with 1G interfaces. We're using latest Proxmox with all updated
packages. There's one difference, though - 3 nodes use 2.6.32-34-pve
kernel as they had IPv6 issues with the latest kernel (2.6.32-37-pve
- working on other nodes).<br>
<br>
Everything is working good until I try to add a new node. As soon as
I do that, whole GUI breaks (KVM stays working, luckily) and "all
hell breaks loose," as it's said.<br>
<br>
So, we have eliminated network card issues - as this problem occurs
with different network cards. We have eliminated switches' issues,
because all switches are working prior to this situation AND we have
tried to use 10GB->1GB gbic module to connect this new node to
10G switch as well.<br>
Now, we have eliminated this Fujitsu hardware totally, because a HP
machine also breaks the cluster.<br>
<br>
IGMP snooping is disabled, multicast is working on both sides,
tested with ssmping.<br>
<br>
<b>clustat</b> shows that all nodes are online.<br>
<br>
<b>pvecm nodes</b> shows that everything is OK. All nodes have
"join" time and "M" in Sts column. "Inc" differs, though.<br>
<br>
<b>tcpdump</b> shows:<br>
<blockquote type="cite">12:15:57.535798 IP 0.0.0.0 >
all-systems.mcast.net: igmp query v2
<br>
12:15:57.535831 IP6 101:80a:30b:6e28:cd3:1d7f:2f00:0 > ff02::1:
HBH
ICMP6, multicast listener querymax resp delay: 1000 addr: ::,
length 24
<br>
12:15:57.540356 IP 0.0.0.0 > all-systems.mcast.net: igmp query
v2
<br>
12:15:57.540384 IP6 101:80a:21ee:154d:100:: > ff02::1: HBH
ICMP6,
multicast listener querymax resp delay: 1000 addr: ::, length 24
<br>
12:15:57.580874 IP 0.0.0.0 > all-systems.mcast.net: igmp query
v2
<br>
12:15:57.580903 IP6 10::40:918f:a47f:0 > ff02::1: HBH ICMP6,
multicast
listener querymax resp delay: 1000 addr: ::, length 24
<br>
12:15:58.349706 IP valitseja.5404 > harija1.5405: UDP, length
107
<br>
12:15:58.349783 IP harija1.5404 > ve-1.5405: UDP, length 617
<br>
12:16:10.980002 ARP, Reply ve-1 is-at 90:e2:ba:3a:6e:d0 (oui
Unknown),
<br>
length 42
</blockquote>
<br>
Output from log files:<br>
<blockquote type="cite">Apr 09 11:25:26 corosync [QUORUM]
Members[14]: 1 2 3 4 5 6 7 8 9 10 11 13 14 15
</blockquote>
<blockquote type="cite">Apr 9 11:30:27 zoperdaja pmxcfs[4273]:
[dcdb] notice: cpg_join retry 2960
<br>
Apr 9 11:30:28 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join
retry 2970
<br>
Apr 9 11:30:29 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join
retry 2980
<br>
Apr 9 11:30:30 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join
retry 2990
<br>
Apr 9 11:30:31 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join
retry 3000
<br>
Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join
retry 3010
<br>
Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit:
cpg_send_message failed: 9
<br>
Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit:
cpg_send_message failed: 9
<br>
Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit:
cpg_send_message failed: 9
<br>
Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit:
cpg_send_message failed: 9
<br>
Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit:
cpg_send_message failed: 9
<br>
Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit:
cpg_send_message failed: 9
<br>
Apr 9 11:30:33 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join
retry 3020
<br>
Apr 9 11:30:34 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join
retry 3030
<br>
Apr 9 11:30:35 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join
retry 3040
<br>
Apr 9 11:30:36 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join
retry 3050
</blockquote>
<br>
I have read that Proxmox tests with 16 working nodes, but there are
information that someone uses it with more than 16. Although - I
have plenty to go? Of course we have had nodes, which are not in
cluster anymore (deleted), but I assume that they don't count. :)<br>
<br>
Any ideas where to look next?<br>
<br>
All the best<br>
Sten<br>
</body>
</html>