[pve-devel] need help, lost quorum on all nodes

Alexandre DERUMIER aderumier at odiso.com
Mon Jan 14 18:10:35 CET 2013


Hi,

I have lost quorum on my 8 nodes cluster, when trying to upgrade one node to last stable

when the problem occur:

Jan 14 17:25:34 corosync [CLM   ] CLM CONFIGURATION CHANGE
Jan 14 17:25:34 corosync [CLM   ] New Configuration:
Jan 14 17:25:34 corosync [CLM   ]       r(0) ip(10.3.94.38) 
Jan 14 17:25:34 corosync [CLM   ]       r(0) ip(10.3.94.40) 
Jan 14 17:25:34 corosync [CLM   ]       r(0) ip(10.3.94.49) 
Jan 14 17:25:34 corosync [CLM   ]       r(0) ip(10.3.94.50) 
Jan 14 17:25:34 corosync [CLM   ]       r(0) ip(10.3.94.51) 
Jan 14 17:25:34 corosync [CLM   ]       r(0) ip(10.3.94.52) 
Jan 14 17:25:34 corosync [CLM   ]       r(0) ip(10.3.94.53) 
Jan 14 17:25:34 corosync [CLM   ] Members Left:
Jan 14 17:25:34 corosync [CLM   ]       r(0) ip(10.3.94.39) 
Jan 14 17:25:34 corosync [CLM   ] Members Joined:
Jan 14 17:25:34 corosync [QUORUM] Members[7]: 1 2 3 4 5 6 8
Jan 14 17:25:34 corosync [CLM   ] CLM CONFIGURATION CHANGE
Jan 14 17:25:34 corosync [CLM   ] New Configuration:
Jan 14 17:25:34 corosync [CLM   ]       r(0) ip(10.3.94.38) 
Jan 14 17:25:34 corosync [CLM   ]       r(0) ip(10.3.94.40) 
Jan 14 17:25:34 corosync [CLM   ]       r(0) ip(10.3.94.49) 
Jan 14 17:25:34 corosync [CLM   ]       r(0) ip(10.3.94.50) 
Jan 14 17:25:34 corosync [CLM   ]       r(0) ip(10.3.94.51) 
Jan 14 17:25:34 corosync [CLM   ]       r(0) ip(10.3.94.52) 
Jan 14 17:25:34 corosync [CLM   ]       r(0) ip(10.3.94.53) 
Jan 14 17:25:34 corosync [CLM   ] Members Left:
Jan 14 17:25:34 corosync [CLM   ] Members Joined:
Jan 14 17:25:34 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 14 17:25:35 corosync [CPG   ] chosen downlist: sender r(0) ip(10.3.94.53) ; members(old:8 left:1)
Jan 14 17:25:35 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 7c9 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 7c9 7ca 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 7c9 7ca 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 7c9 7ca 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 7c9 7ca 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 7c9 7ca 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 
Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 7c9 7ca 
....
Jan 14 17:29:36 corosync [CLM   ] CLM CONFIGURATION CHANGE
Jan 14 17:29:36 corosync [CLM   ] New Configuration:
Jan 14 17:29:36 corosync [CLM   ]       r(0) ip(10.3.94.40) 
Jan 14 17:29:36 corosync [CLM   ]       r(0) ip(10.3.94.50) 
Jan 14 17:29:36 corosync [CLM   ]       r(0) ip(10.3.94.51) 
Jan 14 17:29:36 corosync [CLM   ]       r(0) ip(10.3.94.53) 
Jan 14 17:29:36 corosync [CLM   ] Members Left:
Jan 14 17:29:36 corosync [CLM   ]       r(0) ip(10.3.94.38) 
Jan 14 17:29:36 corosync [CLM   ]       r(0) ip(10.3.94.49) 
Jan 14 17:29:36 corosync [CLM   ]       r(0) ip(10.3.94.52) 
Jan 14 17:29:36 corosync [CLM   ] Members Joined:
Jan 14 17:29:36 corosync [QUORUM] Members[6]: 1 2 4 5 6 8
Jan 14 17:29:36 corosync [QUORUM] Members[5]: 1 2 4 5 8
Jan 14 17:29:36 corosync [CMAN  ] quorum lost, blocking activity
Jan 14 17:29:36 corosync [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jan 14 17:29:36 corosync [QUORUM] Members[4]: 1 2 4 8
Jan 14 17:29:36 corosync [CLM   ] CLM CONFIGURATION CHANGE
Jan 14 17:29:36 corosync [CLM   ] New Configuration:
Jan 14 17:29:36 corosync [CLM   ]       r(0) ip(10.3.94.40) 
Jan 14 17:29:36 corosync [CLM   ]       r(0) ip(10.3.94.50) 
Jan 14 17:29:36 corosync [CLM   ]       r(0) ip(10.3.94.51) 
Jan 14 17:29:36 corosync [CLM   ]       r(0) ip(10.3.94.53) 
Jan 14 17:29:36 corosync [CLM   ] Members Left:
Jan 14 17:29:36 corosync [CLM   ] Members Joined:
Jan 14 17:29:36 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 14 17:29:36 corosync [CPG   ] chosen downlist: sender r(0) ip(10.3.94.53) ; members(old:7 left:3)
Jan 14 17:29:36 corosync [MAIN  ] Completed service synchronization, ready to provide service.


But I can't get it up anymore

I'm trying 

/etc/init.d/cman restart on each node 
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... Timed-out waiting for cluster



corosync log of node1 when restart cman

Jan 14 18:04:10 corosync [SERV  ] Unloading all Corosync service engines.
Jan 14 18:04:10 corosync [SERV  ] Service engine unloaded: corosync extended virtual synchrony service
Jan 14 18:04:10 corosync [SERV  ] Service engine unloaded: corosync configuration service
Jan 14 18:04:10 corosync [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Jan 14 18:04:10 corosync [SERV  ] Service engine unloaded: corosync cluster config database access v1.01
Jan 14 18:04:10 corosync [SERV  ] Service engine unloaded: corosync profile loading service
Jan 14 18:04:10 corosync [SERV  ] Service engine unloaded: openais cluster membership service B.01.01
Jan 14 18:04:10 corosync [SERV  ] Service engine unloaded: openais checkpoint service B.01.01
Jan 14 18:04:10 corosync [SERV  ] Service engine unloaded: openais event service B.01.01
Jan 14 18:04:10 corosync [SERV  ] Service engine unloaded: openais distributed locking service B.03.01
Jan 14 18:04:10 corosync [SERV  ] Service engine unloaded: openais message service B.03.01
Jan 14 18:04:10 corosync [SERV  ] Service engine unloaded: corosync CMAN membership service 2.90
Jan 14 18:04:10 corosync [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Jan 14 18:04:10 corosync [SERV  ] Service engine unloaded: openais timer service A.01.01
Jan 14 18:04:10 corosync [MAIN  ] Corosync Cluster Engine exiting with status 0 at main.c:1856.
Jan 14 18:04:11 corosync [MAIN  ] Corosync Cluster Engine ('1.4.4'): started and ready to provide service.
Jan 14 18:04:11 corosync [MAIN  ] Corosync built-in features: nss
Jan 14 18:04:11 corosync [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
Jan 14 18:04:11 corosync [MAIN  ] Successfully parsed cman config
Jan 14 18:04:11 corosync [MAIN  ] Successfully configured openais services to load
Jan 14 18:04:11 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).
Jan 14 18:04:11 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jan 14 18:04:11 corosync [TOTEM ] The network interface [10.3.94.49] is now up.
Jan 14 18:04:11 corosync [QUORUM] Using quorum provider quorum_cman
Jan 14 18:04:11 corosync [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Jan 14 18:04:11 corosync [CMAN  ] CMAN 1352871249 (built Nov 14 2012 06:34:12) started
Jan 14 18:04:11 corosync [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
Jan 14 18:04:11 corosync [SERV  ] Service engine loaded: openais cluster membership service B.01.01
Jan 14 18:04:11 corosync [SERV  ] Service engine loaded: openais event service B.01.01
Jan 14 18:04:11 corosync [SERV  ] Service engine loaded: openais checkpoint service B.01.01
Jan 14 18:04:11 corosync [SERV  ] Service engine loaded: openais message service B.03.01
Jan 14 18:04:11 corosync [SERV  ] Service engine loaded: openais distributed locking service B.03.01
Jan 14 18:04:11 corosync [SERV  ] Service engine loaded: openais timer service A.01.01
Jan 14 18:04:11 corosync [SERV  ] Service engine loaded: corosync extended virtual synchrony service
Jan 14 18:04:11 corosync [SERV  ] Service engine loaded: corosync configuration service
Jan 14 18:04:11 corosync [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
Jan 14 18:04:11 corosync [SERV  ] Service engine loaded: corosync cluster config database access v1.01
Jan 14 18:04:11 corosync [SERV  ] Service engine loaded: corosync profile loading service
Jan 14 18:04:11 corosync [QUORUM] Using quorum provider quorum_cman
Jan 14 18:04:11 corosync [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Jan 14 18:04:11 corosync [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Jan 14 18:04:11 corosync [CLM   ] CLM CONFIGURATION CHANGE
Jan 14 18:04:11 corosync [CLM   ] New Configuration:
Jan 14 18:04:11 corosync [CLM   ] Members Left:
Jan 14 18:04:11 corosync [CLM   ] Members Joined:
Jan 14 18:04:11 corosync [CLM   ] CLM CONFIGURATION CHANGE
Jan 14 18:04:11 corosync [CLM   ] New Configuration:
Jan 14 18:04:11 corosync [CLM   ] 	r(0) ip(10.3.94.49) 
Jan 14 18:04:11 corosync [CLM   ] Members Left:
Jan 14 18:04:11 corosync [CLM   ] Members Joined:
Jan 14 18:04:11 corosync [CLM   ] 	r(0) ip(10.3.94.49) 
Jan 14 18:04:11 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 14 18:04:11 corosync [QUORUM] Members[1]: 6
Jan 14 18:04:11 corosync [QUORUM] Members[1]: 6
Jan 14 18:04:11 corosync [CPG   ] chosen downlist: sender r(0) ip(10.3.94.49) ; members(old:0 left:0)
Jan 14 18:04:11 corosync [MAIN  ] Completed service synchronization, ready to provide service.


corosync log of node2 when restart cman

Jan 14 18:05:30 corosync [SERV  ] Service engine unloaded: corosync extended virtual synchrony service
Jan 14 18:05:30 corosync [SERV  ] Service engine unloaded: corosync configuration service
Jan 14 18:05:30 corosync [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Jan 14 18:05:30 corosync [SERV  ] Service engine unloaded: corosync cluster config database access v1.01
Jan 14 18:05:30 corosync [SERV  ] Service engine unloaded: corosync profile loading service
Jan 14 18:05:30 corosync [SERV  ] Service engine unloaded: openais cluster membership service B.01.01
Jan 14 18:05:30 corosync [SERV  ] Service engine unloaded: openais checkpoint service B.01.01
Jan 14 18:05:30 corosync [SERV  ] Service engine unloaded: openais event service B.01.01
Jan 14 18:05:30 corosync [SERV  ] Service engine unloaded: openais distributed locking service B.03.01
Jan 14 18:05:30 corosync [SERV  ] Service engine unloaded: openais message service B.03.01
Jan 14 18:05:30 corosync [SERV  ] Service engine unloaded: corosync CMAN membership service 2.90
Jan 14 18:05:30 corosync [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Jan 14 18:05:30 corosync [SERV  ] Service engine unloaded: openais timer service A.01.01
Jan 14 18:05:30 corosync [MAIN  ] Corosync Cluster Engine exiting with status 0 at main.c:1856.
Jan 14 18:05:31 corosync [MAIN  ] Corosync Cluster Engine ('1.4.4'): started and ready to provide service.
Jan 14 18:05:31 corosync [MAIN  ] Corosync built-in features: nss
Jan 14 18:05:31 corosync [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
Jan 14 18:05:31 corosync [MAIN  ] Successfully parsed cman config
Jan 14 18:05:31 corosync [MAIN  ] Successfully configured openais services to load
Jan 14 18:05:31 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).
Jan 14 18:05:31 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jan 14 18:05:31 corosync [TOTEM ] The network interface [10.3.94.50] is now up.
Jan 14 18:05:31 corosync [QUORUM] Using quorum provider quorum_cman
Jan 14 18:05:31 corosync [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Jan 14 18:05:31 corosync [CMAN  ] CMAN 1352871249 (built Nov 14 2012 06:34:12) started
Jan 14 18:05:31 corosync [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
Jan 14 18:05:31 corosync [SERV  ] Service engine loaded: openais cluster membership service B.01.01
Jan 14 18:05:31 corosync [SERV  ] Service engine loaded: openais event service B.01.01
Jan 14 18:05:31 corosync [SERV  ] Service engine loaded: openais checkpoint service B.01.01
Jan 14 18:05:31 corosync [SERV  ] Service engine loaded: openais message service B.03.01
Jan 14 18:05:31 corosync [SERV  ] Service engine loaded: openais distributed locking service B.03.01
Jan 14 18:05:31 corosync [SERV  ] Service engine loaded: openais timer service A.01.01
Jan 14 18:05:31 corosync [SERV  ] Service engine loaded: corosync extended virtual synchrony service
Jan 14 18:05:31 corosync [SERV  ] Service engine loaded: corosync configuration service
Jan 14 18:05:31 corosync [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
Jan 14 18:05:31 corosync [SERV  ] Service engine loaded: corosync cluster config database access v1.01
Jan 14 18:05:31 corosync [SERV  ] Service engine loaded: corosync profile loading service
Jan 14 18:05:31 corosync [QUORUM] Using quorum provider quorum_cman
Jan 14 18:05:31 corosync [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Jan 14 18:05:31 corosync [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Jan 14 18:05:31 corosync [CLM   ] CLM CONFIGURATION CHANGE
Jan 14 18:05:31 corosync [CLM   ] New Configuration:
Jan 14 18:05:31 corosync [CLM   ] Members Left:
Jan 14 18:05:31 corosync [CLM   ] Members Joined:
Jan 14 18:05:31 corosync [CLM   ] CLM CONFIGURATION CHANGE
Jan 14 18:05:31 corosync [CLM   ] New Configuration:
Jan 14 18:05:31 corosync [CLM   ]       r(0) ip(10.3.94.50) 
Jan 14 18:05:31 corosync [CLM   ] Members Left:
Jan 14 18:05:31 corosync [CLM   ] Members Joined:
Jan 14 18:05:31 corosync [CLM   ]       r(0) ip(10.3.94.50) 
Jan 14 18:05:31 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 14 18:05:31 corosync [QUORUM] Members[1]: 4
Jan 14 18:05:31 corosync [QUORUM] Members[1]: 4
Jan 14 18:05:31 corosync [CPG   ] chosen downlist: sender r(0) ip(10.3.94.50) ; members(old:0 left:0)
Jan 14 18:05:31 corosync [MAIN  ] Completed service synchronization, ready to provide service.


Any idea ?







More information about the pve-devel mailing list