[pve-devel] need help, lost quorum on all nodes

Martin Maurer martin at proxmox.com
Mon Jan 14 21:20:59 CET 2013


Yes, AFAIK looks like  a known issue with non proxmox ve kernels. But we have already a new corosync package.
Dietmar will send the details and can explain the reason and fix.

Martin

> -----Original Message-----
> From: pve-devel-bounces at pve.proxmox.com [mailto:pve-devel-
> bounces at pve.proxmox.com] On Behalf Of Alexandre DERUMIER
> Sent: Montag, 14. Jänner 2013 20:47
> To: pve-devel at pve.proxmox.com
> Subject: Re: [pve-devel] need help, lost quorum on all nodes
> 
> Ok, I found the problem.
> 
> I had installed a custom 3.7 kernel on the upgrade node, and it seem to cause
> problem to corosync cluster (I don't know why,I'll to investigate tomorrow)
> 
> maybe it's related to dlm ?
> 
> ----- Mail original -----
> 
> De: "Alexandre DERUMIER" <aderumier at odiso.com>
> À: pve-devel at pve.proxmox.com
> Envoyé: Lundi 14 Janvier 2013 18:10:35
> Objet: [pve-devel] need help, lost quorum on all nodes
> 
> Hi,
> 
> I have lost quorum on my 8 nodes cluster, when trying to upgrade one node
> to last stable
> 
> when the problem occur:
> 
> Jan 14 17:25:34 corosync [CLM ] CLM CONFIGURATION CHANGE Jan 14
> 17:25:34 corosync [CLM ] New Configuration:
> Jan 14 17:25:34 corosync [CLM ] r(0) ip(10.3.94.38) Jan 14 17:25:34 corosync
> [CLM ] r(0) ip(10.3.94.40) Jan 14 17:25:34 corosync [CLM ] r(0) ip(10.3.94.49)
> Jan 14 17:25:34 corosync [CLM ] r(0) ip(10.3.94.50) Jan 14 17:25:34 corosync
> [CLM ] r(0) ip(10.3.94.51) Jan 14 17:25:34 corosync [CLM ] r(0) ip(10.3.94.52)
> Jan 14 17:25:34 corosync [CLM ] r(0) ip(10.3.94.53) Jan 14 17:25:34 corosync
> [CLM ] Members Left:
> Jan 14 17:25:34 corosync [CLM ] r(0) ip(10.3.94.39) Jan 14 17:25:34 corosync
> [CLM ] Members Joined:
> Jan 14 17:25:34 corosync [QUORUM] Members[7]: 1 2 3 4 5 6 8 Jan 14 17:25:34
> corosync [CLM ] CLM CONFIGURATION CHANGE Jan 14 17:25:34 corosync
> [CLM ] New Configuration:
> Jan 14 17:25:34 corosync [CLM ] r(0) ip(10.3.94.38) Jan 14 17:25:34 corosync
> [CLM ] r(0) ip(10.3.94.40) Jan 14 17:25:34 corosync [CLM ] r(0) ip(10.3.94.49)
> Jan 14 17:25:34 corosync [CLM ] r(0) ip(10.3.94.50) Jan 14 17:25:34 corosync
> [CLM ] r(0) ip(10.3.94.51) Jan 14 17:25:34 corosync [CLM ] r(0) ip(10.3.94.52)
> Jan 14 17:25:34 corosync [CLM ] r(0) ip(10.3.94.53) Jan 14 17:25:34 corosync
> [CLM ] Members Left:
> Jan 14 17:25:34 corosync [CLM ] Members Joined:
> Jan 14 17:25:34 corosync [TOTEM ] A processor joined or left the membership
> and a new membership was formed.
> Jan 14 17:25:35 corosync [CPG ] chosen downlist: sender r(0) ip(10.3.94.53) ;
> members(old:8 left:1) Jan 14 17:25:35 corosync [MAIN ] Completed service
> synchronization, ready to provide service.
> Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf
> 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 Jan 14 17:27:32 corosync [TOTEM ]
> Retransmit List: 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8
> 7c9 Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce
> 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 Jan 14 17:27:32 corosync [TOTEM ]
> Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7
> 7d8 Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7cb 7cc 7cd 7ce 7cf 7d0
> 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 7c9 7ca Jan 14 17:27:32 corosync [TOTEM ]
> Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7
> 7d8 Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce
> 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 Jan 14 17:27:32 corosync [TOTEM ]
> Retransmit List: 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 7c9
> 7ca Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce
> 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 Jan 14 17:27:32 corosync [TOTEM ]
> Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7
> 7d8 Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7cb 7cc 7cd 7ce 7cf 7d0
> 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 7c9 7ca Jan 14 17:27:32 corosync [TOTEM ]
> Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7
> 7d8 Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce
> 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 Jan 14 17:27:32 corosync [TOTEM ]
> Retransmit List: 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 7c9
> 7ca Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce
> 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 Jan 14 17:27:32 corosync [TOTEM ]
> Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7
> 7d8 Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7cb 7cc 7cd 7ce 7cf 7d0
> 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 7c9 7ca Jan 14 17:27:32 corosync [TOTEM ]
> Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7
> 7d8 Jan 14 17:27:32 corosync [TOTEM ] Retransmit List: 7c9 7ca 7cb 7cc 7cd 7ce
> 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 Jan 14 17:27:32 corosync [TOTEM ]
> Retransmit List: 7cb 7cc 7cd 7ce 7cf 7d0 7d1 7d2 7d3 7d4 7d5 7d6 7d7 7d8 7c9
> 7ca ....
> Jan 14 17:29:36 corosync [CLM ] CLM CONFIGURATION CHANGE Jan 14
> 17:29:36 corosync [CLM ] New Configuration:
> Jan 14 17:29:36 corosync [CLM ] r(0) ip(10.3.94.40) Jan 14 17:29:36 corosync
> [CLM ] r(0) ip(10.3.94.50) Jan 14 17:29:36 corosync [CLM ] r(0) ip(10.3.94.51)
> Jan 14 17:29:36 corosync [CLM ] r(0) ip(10.3.94.53) Jan 14 17:29:36 corosync
> [CLM ] Members Left:
> Jan 14 17:29:36 corosync [CLM ] r(0) ip(10.3.94.38) Jan 14 17:29:36 corosync
> [CLM ] r(0) ip(10.3.94.49) Jan 14 17:29:36 corosync [CLM ] r(0) ip(10.3.94.52)
> Jan 14 17:29:36 corosync [CLM ] Members Joined:
> Jan 14 17:29:36 corosync [QUORUM] Members[6]: 1 2 4 5 6 8 Jan 14 17:29:36
> corosync [QUORUM] Members[5]: 1 2 4 5 8 Jan 14 17:29:36 corosync [CMAN ]
> quorum lost, blocking activity Jan 14 17:29:36 corosync [QUORUM] This node
> is within the non-primary component and will NOT provide any services.
> Jan 14 17:29:36 corosync [QUORUM] Members[4]: 1 2 4 8 Jan 14 17:29:36
> corosync [CLM ] CLM CONFIGURATION CHANGE Jan 14 17:29:36 corosync
> [CLM ] New Configuration:
> Jan 14 17:29:36 corosync [CLM ] r(0) ip(10.3.94.40) Jan 14 17:29:36 corosync
> [CLM ] r(0) ip(10.3.94.50) Jan 14 17:29:36 corosync [CLM ] r(0) ip(10.3.94.51)
> Jan 14 17:29:36 corosync [CLM ] r(0) ip(10.3.94.53) Jan 14 17:29:36 corosync
> [CLM ] Members Left:
> Jan 14 17:29:36 corosync [CLM ] Members Joined:
> Jan 14 17:29:36 corosync [TOTEM ] A processor joined or left the membership
> and a new membership was formed.
> Jan 14 17:29:36 corosync [CPG ] chosen downlist: sender r(0) ip(10.3.94.53) ;
> members(old:7 left:3) Jan 14 17:29:36 corosync [MAIN ] Completed service
> synchronization, ready to provide service.
> 
> 
> But I can't get it up anymore
> 
> I'm trying
> 
> /etc/init.d/cman restart on each node
> Starting cluster:
> Checking if cluster has been disabled at boot... [ OK ] Checking Network
> Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ]
> Mounting configfs... [ OK ] Starting cman... [ OK ] Waiting for quorum...
> Timed-out waiting for cluster
> 
> 
> 
> corosync log of node1 when restart cman
> 
> Jan 14 18:04:10 corosync [SERV ] Unloading all Corosync service engines.
> Jan 14 18:04:10 corosync [SERV ] Service engine unloaded: corosync
> extended virtual synchrony service Jan 14 18:04:10 corosync [SERV ] Service
> engine unloaded: corosync configuration service Jan 14 18:04:10 corosync
> [SERV ] Service engine unloaded: corosync cluster closed process group
> service v1.01 Jan 14 18:04:10 corosync [SERV ] Service engine unloaded:
> corosync cluster config database access v1.01 Jan 14 18:04:10 corosync [SERV
> ] Service engine unloaded: corosync profile loading service Jan 14 18:04:10
> corosync [SERV ] Service engine unloaded: openais cluster membership
> service B.01.01 Jan 14 18:04:10 corosync [SERV ] Service engine unloaded:
> openais checkpoint service B.01.01 Jan 14 18:04:10 corosync [SERV ] Service
> engine unloaded: openais event service B.01.01 Jan 14 18:04:10 corosync
> [SERV ] Service engine unloaded: openais distributed locking service B.03.01
> Jan 14 18:04:10 corosync [SERV ] Service engine unloaded: openais message
> service B.03.01 Jan 14 18:04:10 corosync [SERV ] Service engine unloaded:
> corosync CMAN membership service 2.90 Jan 14 18:04:10 corosync [SERV ]
> Service engine unloaded: corosync cluster quorum service v0.1 Jan 14
> 18:04:10 corosync [SERV ] Service engine unloaded: openais timer service
> A.01.01 Jan 14 18:04:10 corosync [MAIN ] Corosync Cluster Engine exiting with
> status 0 at main.c:1856.
> Jan 14 18:04:11 corosync [MAIN ] Corosync Cluster Engine ('1.4.4'): started
> and ready to provide service.
> Jan 14 18:04:11 corosync [MAIN ] Corosync built-in features: nss Jan 14
> 18:04:11 corosync [MAIN ] Successfully read config from
> /etc/cluster/cluster.conf Jan 14 18:04:11 corosync [MAIN ] Successfully
> parsed cman config Jan 14 18:04:11 corosync [MAIN ] Successfully configured
> openais services to load Jan 14 18:04:11 corosync [TOTEM ] Initializing
> transport (UDP/IP Multicast).
> Jan 14 18:04:11 corosync [TOTEM ] Initializing transmit/receive security:
> libtomcrypt SOBER128/SHA1HMAC (mode 0).
> Jan 14 18:04:11 corosync [TOTEM ] The network interface [10.3.94.49] is now
> up.
> Jan 14 18:04:11 corosync [QUORUM] Using quorum provider quorum_cman
> Jan 14 18:04:11 corosync [SERV ] Service engine loaded: corosync cluster
> quorum service v0.1 Jan 14 18:04:11 corosync [CMAN ] CMAN 1352871249
> (built Nov 14 2012 06:34:12) started Jan 14 18:04:11 corosync [SERV ] Service
> engine loaded: corosync CMAN membership service 2.90 Jan 14 18:04:11
> corosync [SERV ] Service engine loaded: openais cluster membership service
> B.01.01 Jan 14 18:04:11 corosync [SERV ] Service engine loaded: openais
> event service B.01.01 Jan 14 18:04:11 corosync [SERV ] Service engine loaded:
> openais checkpoint service B.01.01 Jan 14 18:04:11 corosync [SERV ] Service
> engine loaded: openais message service B.03.01 Jan 14 18:04:11 corosync
> [SERV ] Service engine loaded: openais distributed locking service B.03.01 Jan
> 14 18:04:11 corosync [SERV ] Service engine loaded: openais timer service
> A.01.01 Jan 14 18:04:11 corosync [SERV ] Service engine loaded: corosync
> extended virtual synchrony service Jan 14 18:04:11 corosync [SERV ] Service
> engine loaded: corosync configuration service Jan 14 18:04:11 corosync [SERV
> ] Service engine loaded: corosync cluster closed process group service v1.01
> Jan 14 18:04:11 corosync [SERV ] Service engine loaded: corosync cluster
> config database access v1.01 Jan 14 18:04:11 corosync [SERV ] Service engine
> loaded: corosync profile loading service Jan 14 18:04:11 corosync [QUORUM]
> Using quorum provider quorum_cman Jan 14 18:04:11 corosync [SERV ]
> Service engine loaded: corosync cluster quorum service v0.1 Jan 14 18:04:11
> corosync [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of
> the synchronization engine.
> Jan 14 18:04:11 corosync [CLM ] CLM CONFIGURATION CHANGE Jan 14
> 18:04:11 corosync [CLM ] New Configuration:
> Jan 14 18:04:11 corosync [CLM ] Members Left:
> Jan 14 18:04:11 corosync [CLM ] Members Joined:
> Jan 14 18:04:11 corosync [CLM ] CLM CONFIGURATION CHANGE Jan 14
> 18:04:11 corosync [CLM ] New Configuration:
> Jan 14 18:04:11 corosync [CLM ] r(0) ip(10.3.94.49) Jan 14 18:04:11 corosync
> [CLM ] Members Left:
> Jan 14 18:04:11 corosync [CLM ] Members Joined:
> Jan 14 18:04:11 corosync [CLM ] r(0) ip(10.3.94.49) Jan 14 18:04:11 corosync
> [TOTEM ] A processor joined or left the membership and a new membership
> was formed.
> Jan 14 18:04:11 corosync [QUORUM] Members[1]: 6 Jan 14 18:04:11 corosync
> [QUORUM] Members[1]: 6 Jan 14 18:04:11 corosync [CPG ] chosen downlist:
> sender r(0) ip(10.3.94.49) ; members(old:0 left:0) Jan 14 18:04:11 corosync
> [MAIN ] Completed service synchronization, ready to provide service.
> 
> 
> corosync log of node2 when restart cman
> 
> Jan 14 18:05:30 corosync [SERV ] Service engine unloaded: corosync
> extended virtual synchrony service Jan 14 18:05:30 corosync [SERV ] Service
> engine unloaded: corosync configuration service Jan 14 18:05:30 corosync
> [SERV ] Service engine unloaded: corosync cluster closed process group
> service v1.01 Jan 14 18:05:30 corosync [SERV ] Service engine unloaded:
> corosync cluster config database access v1.01 Jan 14 18:05:30 corosync [SERV
> ] Service engine unloaded: corosync profile loading service Jan 14 18:05:30
> corosync [SERV ] Service engine unloaded: openais cluster membership
> service B.01.01 Jan 14 18:05:30 corosync [SERV ] Service engine unloaded:
> openais checkpoint service B.01.01 Jan 14 18:05:30 corosync [SERV ] Service
> engine unloaded: openais event service B.01.01 Jan 14 18:05:30 corosync
> [SERV ] Service engine unloaded: openais distributed locking service B.03.01
> Jan 14 18:05:30 corosync [SERV ] Service engine unloaded: openais message
> service B.03.01 Jan 14 18:05:30 corosync [SERV ] Service engine unloaded:
> corosync CMAN membership service 2.90 Jan 14 18:05:30 corosync [SERV ]
> Service engine unloaded: corosync cluster quorum service v0.1 Jan 14
> 18:05:30 corosync [SERV ] Service engine unloaded: openais timer service
> A.01.01 Jan 14 18:05:30 corosync [MAIN ] Corosync Cluster Engine exiting with
> status 0 at main.c:1856.
> Jan 14 18:05:31 corosync [MAIN ] Corosync Cluster Engine ('1.4.4'): started
> and ready to provide service.
> Jan 14 18:05:31 corosync [MAIN ] Corosync built-in features: nss Jan 14
> 18:05:31 corosync [MAIN ] Successfully read config from
> /etc/cluster/cluster.conf Jan 14 18:05:31 corosync [MAIN ] Successfully
> parsed cman config Jan 14 18:05:31 corosync [MAIN ] Successfully configured
> openais services to load Jan 14 18:05:31 corosync [TOTEM ] Initializing
> transport (UDP/IP Multicast).
> Jan 14 18:05:31 corosync [TOTEM ] Initializing transmit/receive security:
> libtomcrypt SOBER128/SHA1HMAC (mode 0).
> Jan 14 18:05:31 corosync [TOTEM ] The network interface [10.3.94.50] is now
> up.
> Jan 14 18:05:31 corosync [QUORUM] Using quorum provider quorum_cman
> Jan 14 18:05:31 corosync [SERV ] Service engine loaded: corosync cluster
> quorum service v0.1 Jan 14 18:05:31 corosync [CMAN ] CMAN 1352871249
> (built Nov 14 2012 06:34:12) started Jan 14 18:05:31 corosync [SERV ] Service
> engine loaded: corosync CMAN membership service 2.90 Jan 14 18:05:31
> corosync [SERV ] Service engine loaded: openais cluster membership service
> B.01.01 Jan 14 18:05:31 corosync [SERV ] Service engine loaded: openais
> event service B.01.01 Jan 14 18:05:31 corosync [SERV ] Service engine loaded:
> openais checkpoint service B.01.01 Jan 14 18:05:31 corosync [SERV ] Service
> engine loaded: openais message service B.03.01 Jan 14 18:05:31 corosync
> [SERV ] Service engine loaded: openais distributed locking service B.03.01 Jan
> 14 18:05:31 corosync [SERV ] Service engine loaded: openais timer service
> A.01.01 Jan 14 18:05:31 corosync [SERV ] Service engine loaded: corosync
> extended virtual synchrony service Jan 14 18:05:31 corosync [SERV ] Service
> engine loaded: corosync configuration service Jan 14 18:05:31 corosync [SERV
> ] Service engine loaded: corosync cluster closed process group service v1.01
> Jan 14 18:05:31 corosync [SERV ] Service engine loaded: corosync cluster
> config database access v1.01 Jan 14 18:05:31 corosync [SERV ] Service engine
> loaded: corosync profile loading service Jan 14 18:05:31 corosync [QUORUM]
> Using quorum provider quorum_cman Jan 14 18:05:31 corosync [SERV ]
> Service engine loaded: corosync cluster quorum service v0.1 Jan 14 18:05:31
> corosync [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of
> the synchronization engine.
> Jan 14 18:05:31 corosync [CLM ] CLM CONFIGURATION CHANGE Jan 14
> 18:05:31 corosync [CLM ] New Configuration:
> Jan 14 18:05:31 corosync [CLM ] Members Left:
> Jan 14 18:05:31 corosync [CLM ] Members Joined:
> Jan 14 18:05:31 corosync [CLM ] CLM CONFIGURATION CHANGE Jan 14
> 18:05:31 corosync [CLM ] New Configuration:
> Jan 14 18:05:31 corosync [CLM ] r(0) ip(10.3.94.50) Jan 14 18:05:31 corosync
> [CLM ] Members Left:
> Jan 14 18:05:31 corosync [CLM ] Members Joined:
> Jan 14 18:05:31 corosync [CLM ] r(0) ip(10.3.94.50) Jan 14 18:05:31 corosync
> [TOTEM ] A processor joined or left the membership and a new membership
> was formed.
> Jan 14 18:05:31 corosync [QUORUM] Members[1]: 4 Jan 14 18:05:31 corosync
> [QUORUM] Members[1]: 4 Jan 14 18:05:31 corosync [CPG ] chosen downlist:
> sender r(0) ip(10.3.94.50) ; members(old:0 left:0) Jan 14 18:05:31 corosync
> [MAIN ] Completed service synchronization, ready to provide service.
> 
> 
> Any idea ?
> 
> 
> 
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> _______________________________________________
> pve-devel mailing list
> pve-devel at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


More information about the pve-devel mailing list