[PVE-User] No cluster with Proxmox 2.2
Simone Piccardi
piccardi at truelite.it
Thu Dec 6 15:47:56 CET 2012
Hi,
this is my second attempt to create a cluster, after some failing
It took me some time because I could not use anymore all 4 blades, and I
got to start using the new hardware. So now a single blade is running
standalone with some VM and I'm using the other 3 blades to test the
cluster. I carefully removed and purged all packages, and cleaned all
directories with renmants from previous installation:
rm -fR /etc/cluster/ /var/log/cluster /var/lib/cluster /etc/pve/ \
/usr/share/fence /var/lib/pve-manager /var/lib/pve-cluster/
My network config is the following:
root at lama9:~# ifconfig
bond0 Link encap:Ethernet HWaddr 04:7d:7b:f1:39:28
inet6 addr: fe80::67d:7bff:fef1:3928/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:44813502 errors:164 dropped:0 overruns:0 frame:0
TX packets:165133563 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:20885997862 (19.4 GiB) TX bytes:223670282817 (208.3
GiB)
bond1 Link encap:Ethernet HWaddr 04:7d:7b:f1:39:2a
inet addr:172.16.25.109 Bcast:172.16.25.255 Mask:255.255.255.0
inet6 addr: fe80::67d:7bff:fef1:392a/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:5196211 errors:0 dropped:0 overruns:0 frame:0
TX packets:241691 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:582008767 (555.0 MiB) TX bytes:147799002 (140.9 MiB)
eth0 Link encap:Ethernet HWaddr 04:7d:7b:f1:39:28
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:42441219 errors:79 dropped:0 overruns:0 frame:0
TX packets:165133563 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:20720269118 (19.2 GiB) TX bytes:223670282817 (208.3
GiB)
eth1 Link encap:Ethernet HWaddr 04:7d:7b:f1:39:28
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:2372283 errors:85 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:165728744 (158.0 MiB) TX bytes:0 (0.0 B)
eth2 Link encap:Ethernet HWaddr 04:7d:7b:f1:39:2a
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:2823929 errors:0 dropped:0 overruns:0 frame:0
TX packets:241691 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:416280083 (396.9 MiB) TX bytes:147799002 (140.9 MiB)
eth3 Link encap:Ethernet HWaddr 04:7d:7b:f1:39:2a
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:2372282 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:165728684 (158.0 MiB) TX bytes:0 (0.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:368743 errors:0 dropped:0 overruns:0 frame:0
TX packets:368743 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:61854490 (58.9 MiB) TX bytes:61854490 (58.9 MiB)
venet0 Link encap:UNSPEC HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet6 addr: fe80::1/128 Scope:Link
UP BROADCAST POINTOPOINT RUNNING NOARP MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:3 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
vmbr0 Link encap:Ethernet HWaddr 04:7d:7b:f1:39:28
inet addr:192.168.250.109 Bcast:192.168.251.255
Mask:255.255.254.0
inet6 addr: fe80::67d:7bff:fef1:3928/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2984626 errors:0 dropped:0 overruns:0 frame:0
TX packets:395532 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:236789773 (225.8 MiB) TX bytes:109248994 (104.1 MiB)
I'm using bond1 and 172.16.25.0/24 for multicast, and it seems to work:
root at lama9:~# asmping 239.192.236.33 172.16.25.102
asmping joined (S,G) = (*,239.192.236.234)
pinging 172.16.25.102 from 172.16.25.109
multicast from 172.16.25.102, seq=1 dist=0 time=0.237 ms
unicast from 172.16.25.102, seq=1 dist=0 time=0.857 ms
unicast from 172.16.25.102, seq=2 dist=0 time=0.193 ms
multicast from 172.16.25.102, seq=2 dist=0 time=0.220 ms
unicast from 172.16.25.102, seq=3 dist=0 time=0.203 ms
multicast from 172.16.25.102, seq=3 dist=0 time=0.231 ms
and:
root at lama2:~# ssmpingd
received request from 172.16.25.109
received request from 172.16.25.109
received request from 172.16.25.109
received request from 172.16.25.109
I started creating the cluster:
root at lama2:~# pvecm create Cluster
Restarting pve cluster filesystem: pve-cluster[dcdb] notice: wrote new
cluster config '/etc/cluster/cluster.conf'
.
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Tuning DLM kernel config... [ OK ]
Unfencing self... [ OK ]
root at lama2:~#
When I added the second node I got:
root at lama9:~# pvecm add 172.16.25.102
root at 172.16.25.102's password:
copy corosync auth key
stopping pve-cluster service
Stopping pve cluster filesystem: pve-cluster.
backup old database
Starting pve cluster filesystem : pve-clustercan't create shared ssh key
database '/etc/pve/priv/authorized_keys'
.
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Tuning DLM kernel config... [ OK ]
Unfencing self... fence_node: cannot connect to cman
[FAILED]
waiting for quorum...
and waiting is still there after 15 minutes...
Logs are filled with the following in the new node:
Dec 6 15:37:20 lama9 pmxcfs[245031]: [quorum] crit: quorum_initialize
failed: 6
Dec 6 15:37:20 lama9 pmxcfs[245031]: [confdb] crit: confdb_initialize
failed: 6
Dec 6 15:37:20 lama9 pmxcfs[245031]: [dcdb] crit: cpg_initialize failed: 6
Dec 6 15:37:20 lama9 pmxcfs[245031]: [dcdb] crit: cpg_initialize failed: 6
Dec 6 15:37:24 lama9 pmxcfs[245031]: [status] crit: cpg_send_message
failed: 9
Dec 6 15:37:24 lama9 pmxcfs[245031]: [status] crit: cpg_send_message
failed: 9
Dec 6 15:37:24 lama9 pmxcfs[245031]: [status] crit: cpg_send_message
failed: 9
Dec 6 15:37:24 lama9 pmxcfs[245031]: [status] crit: cpg_send_message
failed: 9
Dec 6 15:37:24 lama9 pmxcfs[245031]: [status] crit: cpg_send_message
failed: 9
Dec 6 15:37:24 lama9 pmxcfs[245031]: [status] crit: cpg_send_message
failed: 9
Dec 6 15:37:26 lama9 pmxcfs[245031]: [quorum] crit: quorum_initialize
failed: 6
Dec 6 15:37:26 lama9 pmxcfs[245031]: [confdb] crit: confdb_initialize
failed: 6
Dec 6 15:37:26 lama9 pmxcfs[245031]: [dcdb] crit: cpg_initialize failed: 6
Dec 6 15:37:26 lama9 pmxcfs[245031]: [dcdb] crit: cpg_initialize failed: 6
Dec 6 15:37:32 lama9 pmxcfs[245031]: [quorum] crit: quorum_initialize
failed: 6
Dec 6 15:37:32 lama9 pmxcfs[245031]: [confdb] crit: confdb_initialize
failed: 6
Dec 6 15:37:32 lama9 pmxcfs[245031]: [dcdb] crit: cpg_initialize failed: 6
Dec 6 15:37:32 lama9 pmxcfs[245031]: [dcdb] crit: cpg_initialize failed: 6
and I got the following in the original one:
Dec 6 15:21:54 lama2 corosync[35840]: [QUORUM] Members[1]: 1
Dec 6 15:21:54 lama2 pmxcfs[35761]: [status] notice: update cluster
info (cluster name SiwebCluster, version = 2)
Dec 6 15:21:57 lama2 corosync[35840]: [CLM ] CLM CONFIGURATION CHANGE
Dec 6 15:21:57 lama2 corosync[35840]: [CLM ] New Configuration:
Dec 6 15:21:57 lama2 corosync[35840]: [CLM ] #011r(0) ip(172.16.25.102)
Dec 6 15:21:57 lama2 corosync[35840]: [CLM ] Members Left:
Dec 6 15:21:57 lama2 corosync[35840]: [CLM ] Members Joined:
Dec 6 15:21:57 lama2 corosync[35840]: [CLM ] CLM CONFIGURATION CHANGE
Dec 6 15:21:57 lama2 corosync[35840]: [CLM ] New Configuration:
Dec 6 15:21:57 lama2 corosync[35840]: [CLM ] #011r(0) ip(172.16.25.102)
Dec 6 15:21:57 lama2 corosync[35840]: [CLM ] #011r(0) ip(172.16.25.109)
Dec 6 15:21:57 lama2 corosync[35840]: [CLM ] Members Left:
Dec 6 15:21:57 lama2 corosync[35840]: [CLM ] Members Joined:
Dec 6 15:21:57 lama2 corosync[35840]: [CLM ] #011r(0) ip(172.16.25.109)
Dec 6 15:21:57 lama2 corosync[35840]: [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Dec 6 15:21:57 lama2 corosync[35840]: [QUORUM] Members[2]: 1 2
Dec 6 15:21:57 lama2 corosync[35840]: [QUORUM] Members[2]: 1 2
Dec 6 15:21:57 lama2 corosync[35840]: [QUORUM] Members[2]: 1 2
Dec 6 15:21:57 lama2 corosync[35840]: [CPG ] chosen downlist:
sender r(0) ip(172.16.25.102) ; members(old:1 left:0)
Dec 6 15:21:57 lama2 corosync[35840]: [MAIN ] Completed service
synchronization, ready to provide service.
Dec 6 15:22:11 lama2 corosync[35840]: [TOTEM ] A processor failed,
forming new configuration.
Dec 6 15:22:13 lama2 corosync[35840]: [CLM ] CLM CONFIGURATION CHANGE
Dec 6 15:22:13 lama2 corosync[35840]: [CLM ] New Configuration:
Dec 6 15:22:13 lama2 corosync[35840]: [CLM ] #011r(0) ip(172.16.25.102)
Dec 6 15:22:13 lama2 corosync[35840]: [CLM ] Members Left:
Dec 6 15:22:13 lama2 corosync[35840]: [CLM ] #011r(0) ip(172.16.25.109)
Dec 6 15:22:13 lama2 corosync[35840]: [CLM ] Members Joined:
Dec 6 15:22:13 lama2 corosync[35840]: [CMAN ] quorum lost, blocking
activity
Dec 6 15:22:13 lama2 corosync[35840]: [QUORUM] This node is within
the non-primary component and will NOT provide any services.
Dec 6 15:22:13 lama2 pmxcfs[35761]: [status] notice: node lost quorum
Dec 6 15:22:13 lama2 corosync[35840]: [QUORUM] Members[1]: 1
Dec 6 15:22:13 lama2 corosync[35840]: [CLM ] CLM CONFIGURATION CHANGE
Dec 6 15:22:13 lama2 corosync[35840]: [CLM ] New Configuration:
Dec 6 15:22:13 lama2 corosync[35840]: [CLM ] #011r(0) ip(172.16.25.102)
Dec 6 15:22:13 lama2 corosync[35840]: [CLM ] Members Left:
Dec 6 15:22:13 lama2 corosync[35840]: [CLM ] Members Joined:
Dec 6 15:22:13 lama2 corosync[35840]: [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Dec 6 15:22:13 lama2 kernel: dlm: closing connection to node 2
Dec 6 15:22:13 lama2 corosync[35840]: [CPG ] chosen downlist:
sender r(0) ip(172.16.25.102) ; members(old:2 left:1)
Dec 6 15:22:13 lama2 corosync[35840]: [MAIN ] Completed service
synchronization, ready to provide service.
So up to now I have no way to get a working cluster.
Is there something wrong using a different and isolated network to build
the cluster?
Simone
--
Simone Piccardi Truelite Srl
piccardi at truelite.it (email/jabber) Via Monferrato, 6
Tel. +39-347-1032433 50142 Firenze
http://www.truelite.it Tel. +39-055-7879597 Fax. +39-055-7333336
More information about the pve-user
mailing list