[PVE-User] No cluster with Proxmox 2.2

Thu Dec 6 15:47:56 CET 2012

Hi,

this is my second attempt to create a cluster, after some failing

It took me some time because I could not use anymore all 4 blades, and I 
got to start using the new hardware. So now a single blade is running 
standalone with some VM and I'm using the other 3 blades to test the 
cluster. I carefully removed and purged all packages, and cleaned all 
directories with renmants from previous installation:

rm -fR /etc/cluster/ /var/log/cluster /var/lib/cluster /etc/pve/ \
/usr/share/fence /var/lib/pve-manager /var/lib/pve-cluster/

My network config is the following:

root at lama9:~# ifconfig
bond0     Link encap:Ethernet  HWaddr 04:7d:7b:f1:39:28
           inet6 addr: fe80::67d:7bff:fef1:3928/64 Scope:Link
           UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
           RX packets:44813502 errors:164 dropped:0 overruns:0 frame:0
           TX packets:165133563 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0
           RX bytes:20885997862 (19.4 GiB)  TX bytes:223670282817 (208.3 
GiB)

bond1     Link encap:Ethernet  HWaddr 04:7d:7b:f1:39:2a
           inet addr:172.16.25.109  Bcast:172.16.25.255  Mask:255.255.255.0
           inet6 addr: fe80::67d:7bff:fef1:392a/64 Scope:Link
           UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
           RX packets:5196211 errors:0 dropped:0 overruns:0 frame:0
           TX packets:241691 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0
           RX bytes:582008767 (555.0 MiB)  TX bytes:147799002 (140.9 MiB)

eth0      Link encap:Ethernet  HWaddr 04:7d:7b:f1:39:28
           UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
           RX packets:42441219 errors:79 dropped:0 overruns:0 frame:0
           TX packets:165133563 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:20720269118 (19.2 GiB)  TX bytes:223670282817 (208.3 
GiB)

eth1      Link encap:Ethernet  HWaddr 04:7d:7b:f1:39:28
           UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
           RX packets:2372283 errors:85 dropped:0 overruns:0 frame:0
           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:165728744 (158.0 MiB)  TX bytes:0 (0.0 B)

eth2      Link encap:Ethernet  HWaddr 04:7d:7b:f1:39:2a
           UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
           RX packets:2823929 errors:0 dropped:0 overruns:0 frame:0
           TX packets:241691 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:416280083 (396.9 MiB)  TX bytes:147799002 (140.9 MiB)

eth3      Link encap:Ethernet  HWaddr 04:7d:7b:f1:39:2a
           UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
           RX packets:2372282 errors:0 dropped:0 overruns:0 frame:0
           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:165728684 (158.0 MiB)  TX bytes:0 (0.0 B)

lo        Link encap:Local Loopback
           inet addr:127.0.0.1  Mask:255.0.0.0
           inet6 addr: ::1/128 Scope:Host
           UP LOOPBACK RUNNING  MTU:16436  Metric:1
           RX packets:368743 errors:0 dropped:0 overruns:0 frame:0
           TX packets:368743 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0
           RX bytes:61854490 (58.9 MiB)  TX bytes:61854490 (58.9 MiB)

venet0    Link encap:UNSPEC  HWaddr 
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
           inet6 addr: fe80::1/128 Scope:Link
           UP BROADCAST POINTOPOINT RUNNING NOARP  MTU:1500  Metric:1
           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
           TX packets:0 errors:0 dropped:3 overruns:0 carrier:0
           collisions:0 txqueuelen:0
           RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

vmbr0     Link encap:Ethernet  HWaddr 04:7d:7b:f1:39:28
           inet addr:192.168.250.109  Bcast:192.168.251.255 
Mask:255.255.254.0
           inet6 addr: fe80::67d:7bff:fef1:3928/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
           RX packets:2984626 errors:0 dropped:0 overruns:0 frame:0
           TX packets:395532 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0
           RX bytes:236789773 (225.8 MiB)  TX bytes:109248994 (104.1 MiB)

I'm using bond1 and 172.16.25.0/24 for multicast, and it seems to work:

root at lama9:~# asmping 239.192.236.33 172.16.25.102
asmping joined (S,G) = (*,239.192.236.234)
pinging 172.16.25.102 from 172.16.25.109
multicast from 172.16.25.102, seq=1 dist=0 time=0.237 ms
   unicast from 172.16.25.102, seq=1 dist=0 time=0.857 ms
   unicast from 172.16.25.102, seq=2 dist=0 time=0.193 ms
multicast from 172.16.25.102, seq=2 dist=0 time=0.220 ms
   unicast from 172.16.25.102, seq=3 dist=0 time=0.203 ms
multicast from 172.16.25.102, seq=3 dist=0 time=0.231 ms

and:

root at lama2:~# ssmpingd
received request from 172.16.25.109
received request from 172.16.25.109
received request from 172.16.25.109
received request from 172.16.25.109

I started creating the cluster:

root at lama2:~# pvecm create Cluster
Restarting pve cluster filesystem: pve-cluster[dcdb] notice: wrote new 
cluster config '/etc/cluster/cluster.conf'
.
Starting cluster:
    Checking if cluster has been disabled at boot... [  OK  ]
    Checking Network Manager... [  OK  ]
    Global setup... [  OK  ]
    Loading kernel modules... [  OK  ]
    Mounting configfs... [  OK  ]
    Starting cman... [  OK  ]
    Waiting for quorum... [  OK  ]
    Starting fenced... [  OK  ]
    Starting dlm_controld... [  OK  ]
    Tuning DLM kernel config... [  OK  ]
    Unfencing self... [  OK  ]
root at lama2:~#

When I added the second node I got:

root at lama9:~# pvecm add 172.16.25.102
root at 172.16.25.102's password:
copy corosync auth key
stopping pve-cluster service
Stopping pve cluster filesystem: pve-cluster.
backup old database
Starting pve cluster filesystem : pve-clustercan't create shared ssh key 
database '/etc/pve/priv/authorized_keys'
.
Starting cluster:
    Checking if cluster has been disabled at boot... [  OK  ]
    Checking Network Manager... [  OK  ]
    Global setup... [  OK  ]
    Loading kernel modules... [  OK  ]
    Mounting configfs... [  OK  ]
    Starting cman... [  OK  ]
    Waiting for quorum... [  OK  ]
    Starting fenced... [  OK  ]
    Starting dlm_controld... [  OK  ]
    Tuning DLM kernel config... [  OK  ]
    Unfencing self... fence_node: cannot connect to cman
[FAILED]
waiting for quorum...

and waiting is still there after 15 minutes...

Logs are filled with the following in the new node:

Dec  6 15:37:20 lama9 pmxcfs[245031]: [quorum] crit: quorum_initialize 
failed: 6
Dec  6 15:37:20 lama9 pmxcfs[245031]: [confdb] crit: confdb_initialize 
failed: 6
Dec  6 15:37:20 lama9 pmxcfs[245031]: [dcdb] crit: cpg_initialize failed: 6
Dec  6 15:37:20 lama9 pmxcfs[245031]: [dcdb] crit: cpg_initialize failed: 6
Dec  6 15:37:24 lama9 pmxcfs[245031]: [status] crit: cpg_send_message 
failed: 9
Dec  6 15:37:24 lama9 pmxcfs[245031]: [status] crit: cpg_send_message 
failed: 9
Dec  6 15:37:24 lama9 pmxcfs[245031]: [status] crit: cpg_send_message 
failed: 9
Dec  6 15:37:24 lama9 pmxcfs[245031]: [status] crit: cpg_send_message 
failed: 9
Dec  6 15:37:24 lama9 pmxcfs[245031]: [status] crit: cpg_send_message 
failed: 9
Dec  6 15:37:24 lama9 pmxcfs[245031]: [status] crit: cpg_send_message 
failed: 9
Dec  6 15:37:26 lama9 pmxcfs[245031]: [quorum] crit: quorum_initialize 
failed: 6
Dec  6 15:37:26 lama9 pmxcfs[245031]: [confdb] crit: confdb_initialize 
failed: 6
Dec  6 15:37:26 lama9 pmxcfs[245031]: [dcdb] crit: cpg_initialize failed: 6
Dec  6 15:37:26 lama9 pmxcfs[245031]: [dcdb] crit: cpg_initialize failed: 6
Dec  6 15:37:32 lama9 pmxcfs[245031]: [quorum] crit: quorum_initialize 
failed: 6
Dec  6 15:37:32 lama9 pmxcfs[245031]: [confdb] crit: confdb_initialize 
failed: 6
Dec  6 15:37:32 lama9 pmxcfs[245031]: [dcdb] crit: cpg_initialize failed: 6
Dec  6 15:37:32 lama9 pmxcfs[245031]: [dcdb] crit: cpg_initialize failed: 6

and I got the following in the original one:

Dec  6 15:21:54 lama2 corosync[35840]:   [QUORUM] Members[1]: 1
Dec  6 15:21:54 lama2 pmxcfs[35761]: [status] notice: update cluster 
info (cluster name  SiwebCluster, version = 2)
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] CLM CONFIGURATION CHANGE
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] New Configuration:
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] #011r(0) ip(172.16.25.102)
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] Members Left:
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] Members Joined:
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] CLM CONFIGURATION CHANGE
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] New Configuration:
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] #011r(0) ip(172.16.25.102)
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] #011r(0) ip(172.16.25.109)
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] Members Left:
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] Members Joined:
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] #011r(0) ip(172.16.25.109)
Dec  6 15:21:57 lama2 corosync[35840]:   [TOTEM ] A processor joined or 
left the membership and a new membership was formed.
Dec  6 15:21:57 lama2 corosync[35840]:   [QUORUM] Members[2]: 1 2
Dec  6 15:21:57 lama2 corosync[35840]:   [QUORUM] Members[2]: 1 2
Dec  6 15:21:57 lama2 corosync[35840]:   [QUORUM] Members[2]: 1 2
Dec  6 15:21:57 lama2 corosync[35840]:   [CPG   ] chosen downlist: 
sender r(0) ip(172.16.25.102) ; members(old:1 left:0)
Dec  6 15:21:57 lama2 corosync[35840]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Dec  6 15:22:11 lama2 corosync[35840]:   [TOTEM ] A processor failed, 
forming new configuration.
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] CLM CONFIGURATION CHANGE
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] New Configuration:
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] #011r(0) ip(172.16.25.102)
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] Members Left:
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] #011r(0) ip(172.16.25.109)
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] Members Joined:
Dec  6 15:22:13 lama2 corosync[35840]:   [CMAN  ] quorum lost, blocking 
activity
Dec  6 15:22:13 lama2 corosync[35840]:   [QUORUM] This node is within 
the non-primary component and will NOT provide any services.
Dec  6 15:22:13 lama2 pmxcfs[35761]: [status] notice: node lost quorum
Dec  6 15:22:13 lama2 corosync[35840]:   [QUORUM] Members[1]: 1
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] CLM CONFIGURATION CHANGE
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] New Configuration:
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] #011r(0) ip(172.16.25.102)
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] Members Left:
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] Members Joined:
Dec  6 15:22:13 lama2 corosync[35840]:   [TOTEM ] A processor joined or 
left the membership and a new membership was formed.
Dec  6 15:22:13 lama2 kernel: dlm: closing connection to node 2
Dec  6 15:22:13 lama2 corosync[35840]:   [CPG   ] chosen downlist: 
sender r(0) ip(172.16.25.102) ; members(old:2 left:1)
Dec  6 15:22:13 lama2 corosync[35840]:   [MAIN  ] Completed service 
synchronization, ready to provide service.

So up to now I have no way to get a working cluster.

Is there something wrong using a different and isolated network to build 
the cluster?

Simone
-- 
Simone Piccardi                                 Truelite Srl
piccardi at truelite.it (email/jabber)             Via Monferrato, 6
Tel. +39-347-1032433                            50142 Firenze
http://www.truelite.it  Tel. +39-055-7879597    Fax. +39-055-7333336