[PVE-User] Cluster doesn't recover automatically after blackout
Alwin Antreich
a.antreich at proxmox.com
Wed Aug 1 12:56:05 CEST 2018
Hi,
On Wed, Aug 01, 2018 at 11:02:18AM +0200, Eneko Lacunza wrote:
> Hi all,
>
> This morning there was a quite long blackout which powered off a cluster of
> 3 proxmox 5.1 servers.
>
> All 3 servers the same make and model, so they need the same amount of time
> to boot.
>
> When the power came back, servers started correctly but corosync couldn't
> set up a quorum. Events timing:
I recommend against, servers returning automatically to previous power
state after a power loss. A manual start up is better, as by then the
admin made sure power is back to normal operation. This will also reduce
the chance of breakage if there are subsequent power or hardware
failures.
>
> 07:57:10 corosync start
> 07:57:15 first pmxcfs error quorum_initialize_failed: 2
> 07:57:52 network up
> 07:58:40 Corosync timeout
> 07:59:57 time sync works
>
> What I can see is that network switch boot was slower than server's, but
> nonetheless network was operational about 45s before corosync gives up
> trying to set up a quorum.
>
> I also can see that internet access wasn't back until 1 minute after
> corosync timeout (the time sync event).
>
> A simple restart of pve-cluster at about 9:50 restored the cluster to normal
> state.
>
> Is this expected? I expected that corosync would set up a quorum after
> network was operational....
When was multicast working again? That might have taken longer, as IGMP
snooping and the querier on the switch might just take longer to get
operating again.
--
Cheers,
Alwin
More information about the pve-user
mailing list