[PVE-User] Cluster doesn't recover automatically after blackout

Eneko Lacunza elacunza at binovo.es
Wed Aug 1 13:40:34 CEST 2018


Hi Alwin,

El 01/08/18 a las 12:56, Alwin Antreich escribió:
> On Wed, Aug 01, 2018 at 11:02:18AM +0200, Eneko Lacunza wrote:
>> Hi all,
>>
>> This morning there was a quite long blackout which powered off a cluster of
>> 3 proxmox 5.1 servers.
>>
>> All 3 servers the same make and model, so they need the same amount of time
>> to boot.
>>
>> When the power came back, servers started correctly but corosync couldn't
>> set up a quorum. Events timing:
> I recommend against, servers returning automatically to previous power
> state after a power loss. A manual start up is better, as by then the
> admin made sure power is back to normal operation. This will also reduce
> the chance of breakage if there are subsequent power or hardware
> failures.
This is an off-site place with no knowledgeable sysadmins and servers 
don't have remote control cards. I'm sure they would screw the boot up  :)

I'm afraid we have to take the risk. :)
>
>> 07:57:10 corosync start
>> 07:57:15 first pmxcfs error quorum_initialize_failed: 2
>> 07:57:52 network up
>> 07:58:40 Corosync timeout
>> 07:59:57 time sync works
>>
>> What I can see is that network switch boot was slower than server's, but
>> nonetheless network was operational about 45s before corosync gives up
>> trying to set up a quorum.
>>
>> I also can see that internet access wasn't back until 1 minute after
>> corosync timeout (the time sync event).
>>
>> A simple restart of pve-cluster at about 9:50 restored the cluster to normal
>> state.
>>
>> Is this expected? I expected that corosync would set up a quorum after
>> network was operational....
> When was multicast working again? That might have taken longer, as IGMP
> snooping and the querier on the switch might just take longer to get
> operating again.
I don't have that info (or I don't know how to look that in the logs, 
/var/log/corosync is empty). I'm trying to plan an intentional blackout 
to test things again with technicians onsite, we could get more info 
that day.

Switch is HPE 1820-24G J9980A, it's L2 but quite dumb; we have serveral 
18x0 switches deployed with good results so far.

Thanks a lot,
Eneko

-- 
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es



More information about the pve-user mailing list