[PVE-User] TASK ERROR: cluster not ready - no quorum?

Shain Miley smiley at npr.org
Thu Mar 12 15:22:09 CET 2015


Is anyone else having issues with multicast/quorum after upgrading to 3.4?

We have not been able to get our cluster back into a healthy state since 
the upgrade last weekend.

I found this post here:

http://pve.proxmox.com/pipermail/pve-devel/2015-February/014356.html

which suggests there might be an issue with the 2.6.32-37 kernel .

We have downgraded the kernel on 9 of our 19 servers to 2.6.32-34, 
however those 9 servers still cannot see each other according to 'pvecm 
nodes':

Using the 2.6.32-37 kernels it appeared as though the nodes could see 
each other...however /etc/pve continued to be in a read-only 
state...even after a 'quorum' was formed (according to pvecm status).

Does anyone have any suggestions?

Shain





At this point we are unsure how to proceed and we cannot really continue 
to just reboot hosts over and over again.










On 03/09/2015 03:04 PM, Shain Miley wrote:
> Ok...after some testing it seems like the new 3.4 servers are dropping 
> (or at least not getting) multicast packets:
>
> Here is a test between two 3.4 proxmox servers:
>
> root at proxmox3:~# asmping 224.0.2.1 proxmox1.npr.org
> asmping joined (S,G) = (*,224.0.2.234)
> pinging 172.31.2.141 from 172.31.2.33
>   unicast from 172.31.2.141, seq=1 dist=0 time=1.592 ms
>   unicast from 172.31.2.141, seq=2 dist=0 time=0.163 ms
>   unicast from 172.31.2.141, seq=3 dist=0 time=0.136 ms
>   unicast from 172.31.2.141, seq=4 dist=0 time=0.117 ms
> ........
>
> --- 172.31.2.141 statistics ---
> 11 packets transmitted, time 10702 ms
> unicast:
>    11 packets received, 0% packet loss
>    rtt min/avg/max/std-dev = 0.107/0.261/1.592/0.421 ms
> multicast:
>    0 packets received, 100% packet loss
>
>
>
> and here are two other servers (ubuntu and debian) connected to the 
> same set of switches as the servers above:
>
> root at test2:~# asmping 224.0.2.1 testserver1.npr.org
> asmping joined (S,G) = (*,224.0.2.234)
> pinging 172.31.2.125 from 172.31.2.131
> multicast from 172.31.2.125, seq=1 dist=0 time=0.203 ms
>   unicast from 172.31.2.125, seq=1 dist=0 time=0.322 ms
>   unicast from 172.31.2.125, seq=2 dist=0 time=0.143 ms
> multicast from 172.31.2.125, seq=2 dist=0 time=0.150 ms
>   unicast from 172.31.2.125, seq=3 dist=0 time=0.138 ms
> multicast from 172.31.2.125, seq=3 dist=0 time=0.146 ms
>   unicast from 172.31.2.125, seq=4 dist=0 time=0.122 ms
> .........
>
> --- 172.31.2.125 statistics ---
> 9 packets transmitted, time 8115 ms
> unicast:
>    9 packets received, 0% packet loss
>    rtt min/avg/max/std-dev = 0.114/0.150/0.322/0.061 ms
> multicast:
>    9 packets received, 0% packet loss since first mc packet (seq 1) recvd
>    rtt min/avg/max/std-dev = 0.118/0.142/0.203/0.026 ms
>
> As you can see multicast works fine there.
>
>
> All servers are running 2.6.32 kernels but not all the same version 
> (2.6.32-23-pve - 2.6.32-37-pve)
>
> Anyone have any suggestions as to why the Proxmox servers are not 
> seeing the multicast traffic?
>
> Thanks,
>
> Shain
>
> On 3/9/15 12:33 PM, Shain Miley wrote:
>> I am looking into the possibility that there is a multicast issue 
>> here as I am unable to ping any of the multicast ip address on any of 
>> the nodes.
>>
>> I have reached out to cisco support for some additional help.
>>
>> I will let you know what I find out.
>>
>> Thanks again,
>>
>> Shain
>>
>>
>> On 3/9/15 11:54 AM, Eneko Lacunza wrote:
>>> It seems yesterday something happened at 20:40:53:
>>>
>>> Mar 08 20:40:53 corosync [TOTEM ] FAILED TO RECEIVE
>>> Mar 08 20:41:05 corosync [CLM   ] CLM CONFIGURATION CHANGE
>>> Mar 08 20:41:05 corosync [CLM   ] New Configuration:
>>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.48)
>>> Mar 08 20:41:05 corosync [CLM   ] Members Left:
>>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.16)
>>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.33)
>>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.49)
>>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.50)
>>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.69)
>>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.75)
>>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.77)
>>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.87)
>>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.141)
>>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.142)
>>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.161)
>>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.163)
>>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.165)
>>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.215)
>>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.216)
>>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.219)
>>> Mar 08 20:41:05 corosync [CLM   ] Members Joined:
>>> Mar 08 20:41:05 corosync [QUORUM] Members[16]: 1 2 4 5 6 7 8 10 11 
>>> 12 13 14 15 16 17 19
>>> Mar 08 20:41:05 corosync [QUORUM] Members[15]: 1 2 4 5 6 7 8 11 12 
>>> 13 14 15 16 17 19
>>> Mar 08 20:41:05 corosync [QUORUM] Members[14]: 1 2 4 5 6 7 8 11 12 
>>> 14 15 16 17 19
>>> Mar 08 20:41:05 corosync [QUORUM] Members[13]: 1 2 4 5 6 7 8 11 12 
>>> 15 16 17 19
>>> Mar 08 20:41:05 corosync [QUORUM] Members[12]: 1 2 4 5 6 7 8 11 12 
>>> 15 17 19
>>> Mar 08 20:41:05 corosync [QUORUM] Members[11]: 1 2 4 5 6 7 8 11 12 15 17
>>> Mar 08 20:41:05 corosync [QUORUM] Members[10]: 1 2 4 5 6 7 8 11 12 17
>>> Mar 08 20:41:05 corosync [CMAN  ] quorum lost, blocking activity
>>> Mar 08 20:41:05 corosync [QUORUM] This node is within the 
>>> non-primary component and will NOT provide any services.
>>> Mar 08 20:41:05 corosync [QUORUM] Members[9]: 1 2 5 6 7 8 11 12 17
>>> Mar 08 20:41:05 corosync [QUORUM] Members[8]: 1 2 5 6 7 11 12 17
>>> Mar 08 20:41:05 corosync [QUORUM] Members[7]: 1 2 5 6 7 12 17
>>> Mar 08 20:41:05 corosync [QUORUM] Members[6]: 1 2 6 7 12 17
>>> Mar 08 20:41:05 corosync [QUORUM] Members[5]: 1 2 7 12 17
>>> Mar 08 20:41:05 corosync [QUORUM] Members[4]: 1 2 12 17
>>> Mar 08 20:41:05 corosync [QUORUM] Members[3]: 1 12 17
>>> Mar 08 20:41:05 corosync [QUORUM] Members[2]: 1 12
>>> Mar 08 20:41:05 corosync [QUORUM] Members[1]: 12
>>> Mar 08 20:41:05 corosync [CLM   ] CLM CONFIGURATION CHANGE
>>> Mar 08 20:41:05 corosync [CLM   ] New Configuration:
>>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.48)
>>> Mar 08 20:41:05 corosync [CLM   ] Members Left:
>>> Mar 08 20:41:05 corosync [CLM   ] Members Joined:
>>> Mar 08 20:41:05 corosync [TOTEM ] A processor joined or left the 
>>> membership and a new membership was formed.
>>> Mar 08 20:41:05 corosync [CPG   ] chosen downlist: sender r(0) 
>>> ip(172.31.2.48) ; members(old:17 left:16)
>>> Mar 08 20:41:05 corosync [MAIN  ] Completed service synchronization, 
>>> ready to provide service
>>>
>>> Is the "pvecm nodes" similar in all nodes?
>>>
>>> I don't have experience troubleshooting corosync but it seems you 
>>> have to re-estrablish the corosync cluster and quorum.
>>>
>>> Check "corosync-quorumtool -l -i" . Also check cman_tool command for 
>>> diagnosing the cluster.
>>>
>>> Is corosync service loaded and running? Does restarting it change 
>>> something (service cman restart) ?
>>>
>>>
>>>
>>> On 09/03/15 16:13, Shain Miley wrote:
>>>> Oddly enough...there is nothing in the latest corosync 
>>>> logfile...however the one from last night (when we started seeing 
>>>> the problem) has a lot of info in it.
>>>>
>>>> Here is the link to entire file:
>>>>
>>>> http://717b5bb5f6a032ce28eb-fa7f03050c118691fd4b41bf00a93863.r71.cf1.rackcdn.com/corosync.log.1
>>>>
>>>> Thanks again for your help so far.
>>>>
>>>> Shain
>>>>
>>>> On 3/9/15 10:53 AM, Eneko Lacunza wrote:
>>>>> What about /var/log/cluster/corosync.log ?
>>>>>
>>>>> On 09/03/15 15:34, Shain Miley wrote:
>>>>>> Yes,
>>>>>>
>>>>>> All the nodes are pingable and resolvable via their hostname.
>>>>>>
>>>>>> Here is the ouput of 'pvecm nodes'
>>>>>>
>>>>>>
>>>>>> root at proxmox13:~# pvecm nodes
>>>>>> Node  Sts   Inc   Joined               Name
>>>>>>    1   X    964                        proxmox22
>>>>>>    2   X    964                        proxmox23
>>>>>>    3   X    756                        proxmox24
>>>>>>    4   X    808                        proxmox18
>>>>>>    5   X    964                        proxmox19
>>>>>>    6   X    964                        proxmox20
>>>>>>    7   X    964                        proxmox21
>>>>>>    8   X    964                        proxmox1
>>>>>>    9   X      0                        proxmox2
>>>>>>   10   X    756                        proxmox3
>>>>>>   11   X    964                        proxmox4
>>>>>>   12   M    696   2014-10-20 01:10:09  proxmox13
>>>>>>   13   X    904                        proxmox14
>>>>>>   14   X    848                        proxmox15
>>>>>>   15   X    856                        proxmox16
>>>>>>   16   X    836                        proxmox17
>>>>>>   17   X    964                        proxmox25
>>>>>>   18   X    960                        proxmox26
>>>>>>   19   X    868                        proxmox28
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Shain
>>>>>>
>>>>>> On 3/9/15 10:23 AM, Eneko Lacunza wrote:
>>>>>>> pvecm nodes
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> _NPR | Shain Miley| Manager of Systems and Infrastructure, 
>>>>>> Digital Media | smiley at npr.org | p: 202-513-3649
>>>>>
>>>>>
>>>>> -- 
>>>>> Zuzendari Teknikoa / Director Técnico
>>>>> Binovo IT Human Project, S.L.
>>>>> Telf. 943575997
>>>>>        943493611
>>>>> Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
>>>>> www.binovo.es
>>>>
>>>>
>>>> -- 
>>>> _NPR | Shain Miley| Manager of Systems and Infrastructure, Digital 
>>>> Media | smiley at npr.org | p: 202-513-3649
>>>
>>>
>>> -- 
>>> Zuzendari Teknikoa / Director Técnico
>>> Binovo IT Human Project, S.L.
>>> Telf. 943575997
>>>        943493611
>>> Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
>>> www.binovo.es
>>
>>
>> -- 
>> _NPR | Shain Miley| Manager of Systems and Infrastructure, Digital 
>> Media | smiley at npr.org | p: 202-513-3649
>>
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user at pve.proxmox.com
>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
>
> -- 
> _NPR | Shain Miley| Manager of Systems and Infrastructure, Digital 
> Media | smiley at npr.org | p: 202-513-3649
>
>
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.proxmox.com/pipermail/pve-user/attachments/20150312/a313980f/attachment.htm>


More information about the pve-user mailing list