[PVE-User] Cluster doesn't recover automatically after blackout
Eneko Lacunza
elacunza at binovo.es
Wed Aug 1 16:12:19 CEST 2018
Hi,
El 01/08/18 a las 13:57, Alwin Antreich escribió:
> On Wed, Aug 01, 2018 at 01:40:34PM +0200, Eneko Lacunza wrote:
>> El 01/08/18 a las 12:56, Alwin Antreich escribió:
>>> On Wed, Aug 01, 2018 at 11:02:18AM +0200, Eneko Lacunza wrote:
>>>> Hi all,
>>>>
>>>> This morning there was a quite long blackout which powered off a cluster of
>>>> 3 proxmox 5.1 servers.
>>>>
>>>> All 3 servers the same make and model, so they need the same amount of time
>>>> to boot.
>>>>
>>>> When the power came back, servers started correctly but corosync couldn't
>>>> set up a quorum. Events timing:
>>> I recommend against, servers returning automatically to previous power
>>> state after a power loss. A manual start up is better, as by then the
>>> admin made sure power is back to normal operation. This will also reduce
>>> the chance of breakage if there are subsequent power or hardware
>>> failures.
>> This is an off-site place with no knowledgeable sysadmins and servers don't
>> have remote control cards. I'm sure they would screw the boot up :)
>>
>> I'm afraid we have to take the risk. :)
> A boot delay, if the server have such a setting or switchable UPS power
> plugs might help. :)
Yes, I can do that at grub level, that's no problem. But I have to know
first the correct amount for the delay ;)
>
>>>
>>>> 07:57:10 corosync start
>>>> 07:57:15 first pmxcfs error quorum_initialize_failed: 2
>>>> 07:57:52 network up
>>>> 07:58:40 Corosync timeout
>>>> 07:59:57 time sync works
>>>>
>>>> What I can see is that network switch boot was slower than server's, but
>>>> nonetheless network was operational about 45s before corosync gives up
>>>> trying to set up a quorum.
>>>>
>>>> I also can see that internet access wasn't back until 1 minute after
>>>> corosync timeout (the time sync event).
>>>>
>>>> A simple restart of pve-cluster at about 9:50 restored the cluster to normal
>>>> state.
>>>>
>>>> Is this expected? I expected that corosync would set up a quorum after
>>>> network was operational....
>>> When was multicast working again? That might have taken longer, as IGMP
>>> snooping and the querier on the switch might just take longer to get
>>> operating again.
>> I don't have that info (or I don't know how to look that in the logs,
>> /var/log/corosync is empty). I'm trying to plan an intentional blackout to
>> test things again with technicians onsite, we could get more info that day.
> Corosync writes into the syslog, there should be more to find.
Doesn't seem there is any more to me:
# grep corosync /var/log/syslog
Aug 1 07:57:11 proxmox1 corosync[1697]: [MAIN ] Corosync Cluster
Engine ('2.4.2-dirty'): started and ready to provide service.
Aug 1 07:57:11 proxmox1 corosync[1697]: notice [MAIN ] Corosync
Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Aug 1 07:57:11 proxmox1 corosync[1697]: info [MAIN ] Corosync
built-in features: dbus rdma monitoring watchdog augeas systemd upstart
xmlconf qdevices qnetd snmp pie relro bindnow
Aug 1 07:57:11 proxmox1 corosync[1697]: [MAIN ] Corosync built-in
features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf
qdevices qnetd snmp pie relro bindnow
Aug 1 07:58:40 proxmox1 systemd[1]: corosync.service: Start operation
timed out. Terminating.
Aug 1 07:58:40 proxmox1 systemd[1]: corosync.service: Unit entered
failed state.
Aug 1 07:58:40 proxmox1 systemd[1]: corosync.service: Failed with
result 'timeout'.
Aug 1 09:51:35 proxmox1 corosync[32220]: [MAIN ] Corosync Cluster
Engine ('2.4.2-dirty'): started and ready to provide service.
This last line is our manual pve-cluster restart .
>
>> Switch is HPE 1820-24G J9980A, it's L2 but quite dumb; we have serveral 18x0
>> switches deployed with good results so far.
> The switch may hold a log that shows its startup process.
Seems it was disabled, we have enabled it.
Thanks
Eneko
--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es
More information about the pve-user
mailing list