[PVE-User] TASK ERROR: cluster not ready - no quorum?
Shain Miley
smiley at npr.org
Thu Mar 12 15:22:09 CET 2015
Is anyone else having issues with multicast/quorum after upgrading to 3.4?
We have not been able to get our cluster back into a healthy state since
the upgrade last weekend.
I found this post here:
http://pve.proxmox.com/pipermail/pve-devel/2015-February/014356.html
which suggests there might be an issue with the 2.6.32-37 kernel .
We have downgraded the kernel on 9 of our 19 servers to 2.6.32-34,
however those 9 servers still cannot see each other according to 'pvecm
nodes':
Using the 2.6.32-37 kernels it appeared as though the nodes could see
each other...however /etc/pve continued to be in a read-only
state...even after a 'quorum' was formed (according to pvecm status).
Does anyone have any suggestions?
Shain
At this point we are unsure how to proceed and we cannot really continue
to just reboot hosts over and over again.
On 03/09/2015 03:04 PM, Shain Miley wrote:
> Ok...after some testing it seems like the new 3.4 servers are dropping
> (or at least not getting) multicast packets:
>
> Here is a test between two 3.4 proxmox servers:
>
> root at proxmox3:~# asmping 224.0.2.1 proxmox1.npr.org
> asmping joined (S,G) = (*,224.0.2.234)
> pinging 172.31.2.141 from 172.31.2.33
> unicast from 172.31.2.141, seq=1 dist=0 time=1.592 ms
> unicast from 172.31.2.141, seq=2 dist=0 time=0.163 ms
> unicast from 172.31.2.141, seq=3 dist=0 time=0.136 ms
> unicast from 172.31.2.141, seq=4 dist=0 time=0.117 ms
> ........
>
> --- 172.31.2.141 statistics ---
> 11 packets transmitted, time 10702 ms
> unicast:
> 11 packets received, 0% packet loss
> rtt min/avg/max/std-dev = 0.107/0.261/1.592/0.421 ms
> multicast:
> 0 packets received, 100% packet loss
>
>
>
> and here are two other servers (ubuntu and debian) connected to the
> same set of switches as the servers above:
>
> root at test2:~# asmping 224.0.2.1 testserver1.npr.org
> asmping joined (S,G) = (*,224.0.2.234)
> pinging 172.31.2.125 from 172.31.2.131
> multicast from 172.31.2.125, seq=1 dist=0 time=0.203 ms
> unicast from 172.31.2.125, seq=1 dist=0 time=0.322 ms
> unicast from 172.31.2.125, seq=2 dist=0 time=0.143 ms
> multicast from 172.31.2.125, seq=2 dist=0 time=0.150 ms
> unicast from 172.31.2.125, seq=3 dist=0 time=0.138 ms
> multicast from 172.31.2.125, seq=3 dist=0 time=0.146 ms
> unicast from 172.31.2.125, seq=4 dist=0 time=0.122 ms
> .........
>
> --- 172.31.2.125 statistics ---
> 9 packets transmitted, time 8115 ms
> unicast:
> 9 packets received, 0% packet loss
> rtt min/avg/max/std-dev = 0.114/0.150/0.322/0.061 ms
> multicast:
> 9 packets received, 0% packet loss since first mc packet (seq 1) recvd
> rtt min/avg/max/std-dev = 0.118/0.142/0.203/0.026 ms
>
> As you can see multicast works fine there.
>
>
> All servers are running 2.6.32 kernels but not all the same version
> (2.6.32-23-pve - 2.6.32-37-pve)
>
> Anyone have any suggestions as to why the Proxmox servers are not
> seeing the multicast traffic?
>
> Thanks,
>
> Shain
>
> On 3/9/15 12:33 PM, Shain Miley wrote:
>> I am looking into the possibility that there is a multicast issue
>> here as I am unable to ping any of the multicast ip address on any of
>> the nodes.
>>
>> I have reached out to cisco support for some additional help.
>>
>> I will let you know what I find out.
>>
>> Thanks again,
>>
>> Shain
>>
>>
>> On 3/9/15 11:54 AM, Eneko Lacunza wrote:
>>> It seems yesterday something happened at 20:40:53:
>>>
>>> Mar 08 20:40:53 corosync [TOTEM ] FAILED TO RECEIVE
>>> Mar 08 20:41:05 corosync [CLM ] CLM CONFIGURATION CHANGE
>>> Mar 08 20:41:05 corosync [CLM ] New Configuration:
>>> Mar 08 20:41:05 corosync [CLM ] r(0) ip(172.31.2.48)
>>> Mar 08 20:41:05 corosync [CLM ] Members Left:
>>> Mar 08 20:41:05 corosync [CLM ] r(0) ip(172.31.2.16)
>>> Mar 08 20:41:05 corosync [CLM ] r(0) ip(172.31.2.33)
>>> Mar 08 20:41:05 corosync [CLM ] r(0) ip(172.31.2.49)
>>> Mar 08 20:41:05 corosync [CLM ] r(0) ip(172.31.2.50)
>>> Mar 08 20:41:05 corosync [CLM ] r(0) ip(172.31.2.69)
>>> Mar 08 20:41:05 corosync [CLM ] r(0) ip(172.31.2.75)
>>> Mar 08 20:41:05 corosync [CLM ] r(0) ip(172.31.2.77)
>>> Mar 08 20:41:05 corosync [CLM ] r(0) ip(172.31.2.87)
>>> Mar 08 20:41:05 corosync [CLM ] r(0) ip(172.31.2.141)
>>> Mar 08 20:41:05 corosync [CLM ] r(0) ip(172.31.2.142)
>>> Mar 08 20:41:05 corosync [CLM ] r(0) ip(172.31.2.161)
>>> Mar 08 20:41:05 corosync [CLM ] r(0) ip(172.31.2.163)
>>> Mar 08 20:41:05 corosync [CLM ] r(0) ip(172.31.2.165)
>>> Mar 08 20:41:05 corosync [CLM ] r(0) ip(172.31.2.215)
>>> Mar 08 20:41:05 corosync [CLM ] r(0) ip(172.31.2.216)
>>> Mar 08 20:41:05 corosync [CLM ] r(0) ip(172.31.2.219)
>>> Mar 08 20:41:05 corosync [CLM ] Members Joined:
>>> Mar 08 20:41:05 corosync [QUORUM] Members[16]: 1 2 4 5 6 7 8 10 11
>>> 12 13 14 15 16 17 19
>>> Mar 08 20:41:05 corosync [QUORUM] Members[15]: 1 2 4 5 6 7 8 11 12
>>> 13 14 15 16 17 19
>>> Mar 08 20:41:05 corosync [QUORUM] Members[14]: 1 2 4 5 6 7 8 11 12
>>> 14 15 16 17 19
>>> Mar 08 20:41:05 corosync [QUORUM] Members[13]: 1 2 4 5 6 7 8 11 12
>>> 15 16 17 19
>>> Mar 08 20:41:05 corosync [QUORUM] Members[12]: 1 2 4 5 6 7 8 11 12
>>> 15 17 19
>>> Mar 08 20:41:05 corosync [QUORUM] Members[11]: 1 2 4 5 6 7 8 11 12 15 17
>>> Mar 08 20:41:05 corosync [QUORUM] Members[10]: 1 2 4 5 6 7 8 11 12 17
>>> Mar 08 20:41:05 corosync [CMAN ] quorum lost, blocking activity
>>> Mar 08 20:41:05 corosync [QUORUM] This node is within the
>>> non-primary component and will NOT provide any services.
>>> Mar 08 20:41:05 corosync [QUORUM] Members[9]: 1 2 5 6 7 8 11 12 17
>>> Mar 08 20:41:05 corosync [QUORUM] Members[8]: 1 2 5 6 7 11 12 17
>>> Mar 08 20:41:05 corosync [QUORUM] Members[7]: 1 2 5 6 7 12 17
>>> Mar 08 20:41:05 corosync [QUORUM] Members[6]: 1 2 6 7 12 17
>>> Mar 08 20:41:05 corosync [QUORUM] Members[5]: 1 2 7 12 17
>>> Mar 08 20:41:05 corosync [QUORUM] Members[4]: 1 2 12 17
>>> Mar 08 20:41:05 corosync [QUORUM] Members[3]: 1 12 17
>>> Mar 08 20:41:05 corosync [QUORUM] Members[2]: 1 12
>>> Mar 08 20:41:05 corosync [QUORUM] Members[1]: 12
>>> Mar 08 20:41:05 corosync [CLM ] CLM CONFIGURATION CHANGE
>>> Mar 08 20:41:05 corosync [CLM ] New Configuration:
>>> Mar 08 20:41:05 corosync [CLM ] r(0) ip(172.31.2.48)
>>> Mar 08 20:41:05 corosync [CLM ] Members Left:
>>> Mar 08 20:41:05 corosync [CLM ] Members Joined:
>>> Mar 08 20:41:05 corosync [TOTEM ] A processor joined or left the
>>> membership and a new membership was formed.
>>> Mar 08 20:41:05 corosync [CPG ] chosen downlist: sender r(0)
>>> ip(172.31.2.48) ; members(old:17 left:16)
>>> Mar 08 20:41:05 corosync [MAIN ] Completed service synchronization,
>>> ready to provide service
>>>
>>> Is the "pvecm nodes" similar in all nodes?
>>>
>>> I don't have experience troubleshooting corosync but it seems you
>>> have to re-estrablish the corosync cluster and quorum.
>>>
>>> Check "corosync-quorumtool -l -i" . Also check cman_tool command for
>>> diagnosing the cluster.
>>>
>>> Is corosync service loaded and running? Does restarting it change
>>> something (service cman restart) ?
>>>
>>>
>>>
>>> On 09/03/15 16:13, Shain Miley wrote:
>>>> Oddly enough...there is nothing in the latest corosync
>>>> logfile...however the one from last night (when we started seeing
>>>> the problem) has a lot of info in it.
>>>>
>>>> Here is the link to entire file:
>>>>
>>>> http://717b5bb5f6a032ce28eb-fa7f03050c118691fd4b41bf00a93863.r71.cf1.rackcdn.com/corosync.log.1
>>>>
>>>> Thanks again for your help so far.
>>>>
>>>> Shain
>>>>
>>>> On 3/9/15 10:53 AM, Eneko Lacunza wrote:
>>>>> What about /var/log/cluster/corosync.log ?
>>>>>
>>>>> On 09/03/15 15:34, Shain Miley wrote:
>>>>>> Yes,
>>>>>>
>>>>>> All the nodes are pingable and resolvable via their hostname.
>>>>>>
>>>>>> Here is the ouput of 'pvecm nodes'
>>>>>>
>>>>>>
>>>>>> root at proxmox13:~# pvecm nodes
>>>>>> Node Sts Inc Joined Name
>>>>>> 1 X 964 proxmox22
>>>>>> 2 X 964 proxmox23
>>>>>> 3 X 756 proxmox24
>>>>>> 4 X 808 proxmox18
>>>>>> 5 X 964 proxmox19
>>>>>> 6 X 964 proxmox20
>>>>>> 7 X 964 proxmox21
>>>>>> 8 X 964 proxmox1
>>>>>> 9 X 0 proxmox2
>>>>>> 10 X 756 proxmox3
>>>>>> 11 X 964 proxmox4
>>>>>> 12 M 696 2014-10-20 01:10:09 proxmox13
>>>>>> 13 X 904 proxmox14
>>>>>> 14 X 848 proxmox15
>>>>>> 15 X 856 proxmox16
>>>>>> 16 X 836 proxmox17
>>>>>> 17 X 964 proxmox25
>>>>>> 18 X 960 proxmox26
>>>>>> 19 X 868 proxmox28
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Shain
>>>>>>
>>>>>> On 3/9/15 10:23 AM, Eneko Lacunza wrote:
>>>>>>> pvecm nodes
>>>>>>
>>>>>>
>>>>>> --
>>>>>> _NPR | Shain Miley| Manager of Systems and Infrastructure,
>>>>>> Digital Media | smiley at npr.org | p: 202-513-3649
>>>>>
>>>>>
>>>>> --
>>>>> Zuzendari Teknikoa / Director Técnico
>>>>> Binovo IT Human Project, S.L.
>>>>> Telf. 943575997
>>>>> 943493611
>>>>> Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
>>>>> www.binovo.es
>>>>
>>>>
>>>> --
>>>> _NPR | Shain Miley| Manager of Systems and Infrastructure, Digital
>>>> Media | smiley at npr.org | p: 202-513-3649
>>>
>>>
>>> --
>>> Zuzendari Teknikoa / Director Técnico
>>> Binovo IT Human Project, S.L.
>>> Telf. 943575997
>>> 943493611
>>> Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
>>> www.binovo.es
>>
>>
>> --
>> _NPR | Shain Miley| Manager of Systems and Infrastructure, Digital
>> Media | smiley at npr.org | p: 202-513-3649
>>
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user at pve.proxmox.com
>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
>
> --
> _NPR | Shain Miley| Manager of Systems and Infrastructure, Digital
> Media | smiley at npr.org | p: 202-513-3649
>
>
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.proxmox.com/pipermail/pve-user/attachments/20150312/a313980f/attachment.htm>
More information about the pve-user
mailing list