<html>

  <head>


    <meta http-equiv="content-type" content="text/html; charset=utf-8">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Hi<br>

    <br>

    I have a cluster with 13 working nodes in a dedicated VLAN. Using 4

    switches - DELL 10G, NetExtreme 10G and 2xNetgear 1G for some nodes

    with 1G interfaces. We're using latest Proxmox with all updated

    packages. There's one difference, though - 3 nodes use 2.6.32-34-pve

    kernel as they had IPv6 issues with the latest kernel (2.6.32-37-pve

    - working on other nodes).<br>

    <br>

    Everything is working good until I try to add a new node. As soon as

    I do that, whole GUI breaks (KVM stays working, luckily) and "all

    hell breaks loose," as it's said.<br>

    <br>

    So, we have eliminated network card issues - as this problem occurs

    with different network cards. We have eliminated switches' issues,

    because all switches are working prior to this situation AND we have

    tried to use 10GB->1GB gbic module to connect this new node to

    10G switch as well.<br>

    Now, we have eliminated this Fujitsu hardware totally, because a HP

    machine also breaks the cluster.<br>

    <br>

    IGMP snooping is disabled, multicast is working on both sides,

    tested with ssmping.<br>

    <br>

    <b>clustat</b> shows that all nodes are online.<br>

    <br>

    <b>pvecm nodes</b> shows that everything is OK. All nodes have

    "join" time and "M" in Sts column. "Inc" differs, though.<br>

    <br>

    <b>tcpdump</b> shows:<br>

    <blockquote type="cite">12:15:57.535798 IP 0.0.0.0 >

      all-systems.mcast.net: igmp query v2

      <br>

      12:15:57.535831 IP6 101:80a:30b:6e28:cd3:1d7f:2f00:0 > ff02::1:

      HBH

      ICMP6, multicast listener querymax resp delay: 1000 addr: ::,

      length 24

      <br>

      12:15:57.540356 IP 0.0.0.0 > all-systems.mcast.net: igmp query

      v2

      <br>

      12:15:57.540384 IP6 101:80a:21ee:154d:100:: > ff02::1: HBH

      ICMP6,

      multicast listener querymax resp delay: 1000 addr: ::, length 24

      <br>

      12:15:57.580874 IP 0.0.0.0 > all-systems.mcast.net: igmp query

      v2

      <br>

      12:15:57.580903 IP6 10::40:918f:a47f:0 > ff02::1: HBH ICMP6,

      multicast

      listener querymax resp delay: 1000 addr: ::, length 24

      <br>

      12:15:58.349706 IP valitseja.5404 > harija1.5405: UDP, length

      107

      <br>

      12:15:58.349783 IP harija1.5404 > ve-1.5405: UDP, length 617

      <br>

      12:16:10.980002 ARP, Reply ve-1 is-at 90:e2:ba:3a:6e:d0 (oui

      Unknown),

      <br>

      length 42

    </blockquote>

    <br>

    Output from log files:<br>

    <blockquote type="cite">Apr 09 11:25:26 corosync [QUORUM]

      Members[14]: 1 2 3 4 5 6 7 8 9 10 11 13 14 15

    </blockquote>

    <blockquote type="cite">Apr  9 11:30:27 zoperdaja pmxcfs[4273]:

      [dcdb] notice: cpg_join retry 2960

      <br>

      Apr  9 11:30:28 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join

      retry 2970

      <br>

      Apr  9 11:30:29 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join

      retry 2980

      <br>

      Apr  9 11:30:30 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join

      retry 2990

      <br>

      Apr  9 11:30:31 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join

      retry 3000

      <br>

      Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join

      retry 3010

      <br>

      Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit:

      cpg_send_message failed: 9

      <br>

      Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit:

      cpg_send_message failed: 9

      <br>

      Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit:

      cpg_send_message failed: 9

      <br>

      Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit:

      cpg_send_message failed: 9

      <br>

      Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit:

      cpg_send_message failed: 9

      <br>

      Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit:

      cpg_send_message failed: 9

      <br>

      Apr  9 11:30:33 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join

      retry 3020

      <br>

      Apr  9 11:30:34 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join

      retry 3030

      <br>

      Apr  9 11:30:35 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join

      retry 3040

      <br>

      Apr  9 11:30:36 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join

      retry 3050

    </blockquote>

    <br>

    I have read that Proxmox tests with 16 working nodes, but there are

    information that someone uses it with more than 16. Although - I

    have plenty to go? Of course we have had nodes, which are not in

    cluster anymore (deleted), but I assume that they don't count. :)<br>

    <br>

    Any ideas where to look next?<br>

    <br>

    All the best<br>

    Sten<br>

  </body>

</html>