[PVE-User] Expected fencing behavior on a bifurcated 4-node HA cluster
Thomas Lamprecht
t.lamprecht at proxmox.com
Wed May 3 09:41:54 CEST 2017
Hi,
On 05/02/2017 05:40 PM, Adam Carheden wrote:
> What's supposed to happen if two nodes in a 4-node HA cluster go offline?
If all of them have HA services configured then there may happen a full
cluster reset.
If two nodes go offline the whole cluster looses quorum, so all nodes
with an active watchdog (i.e. all nodes with active services (in the
past)) will reset.
For such situation, where there's a tie an external voting arbitrator
would help, this could be a fifth (tiny) node or an corosync QDevices.
QDevices have the advantage that they can run on any newer Linux Distro
which ship corosync (2.4 and newer AFAIK) independent of the PVE stack.
They can provide arbitrator votes to multiple cluster, and have less
constraints regarding network setup latency as the communication happens
over TCP.
This is usable from PVE but we haven't documented it, which I started to
do and need to pick up again soon.
Just a note for any other reader, while this can boost reliability and
recovery in Clusters with an even vote count (you can only 'win' there),
it can do the reverse in Clusters with uneven Node counts.
>
> I have a 4-node test cluster, two nodes are in one server room and the
> other two in another server room. I had HA inadvertently tested for me
> this morning due to an unexpected network issue and watchdog rebooted
> two of the nodes.
>
> I think this is the expected behavior, and certainly seems like what I
> want to happen. However, quorum is 3, not 2, so why didn't all 4 nodes
> reboot?
Because, if the `ha-manager status` still mirrors the same setup (i.e.
same services on same nodes configured) as when the network failure
happened, I see that just one node hast active services running.
We do not fence nodes which have no configured HA services, or if all of
there configured HA services are disabled.
As we think that this would just lower reliability for non-ha services
but bring no increase in reliability for HA services.
>
> # pvecm status
> Quorum information
> ------------------
> Date: Tue May 2 09:35:23 2017
> Quorum provider: corosync_votequorum
> Nodes: 4
> Node ID: 0x00000001
> Ring ID: 4/524
> Quorate: Yes
>
> Votequorum information
> ----------------------
> Expected votes: 4
> Highest expected: 4
> Total votes: 4
> Quorum: 3
> Flags: Quorate
>
> Membership information
> ----------------------
> Nodeid Votes Name
> 0x00000004 1 192.168.0.11
> 0x00000003 1 192.168.0.203
> 0x00000001 1 192.168.0.204 (local)
> 0x00000002 1 192.168.0.206
>
> # ha-manager status
> quorum OK
> master node3 (active, Tue May 2 09:35:24 2017)
> lrm node1 (idle, Tue May 2 09:35:27 2017)
> lrm node2 (active, Tue May 2 09:35:26 2017)
> lrm node3 (idle, Tue May 2 09:35:23 2017)
> lrm node3 (idle, Tue May 2 09:35:23 2017)
>
> Somehow proxmox was smart enough to keep two of the nodes online, but
> with a quorum of 3 neither group should have had quorum. How does it
> decide which group to keep online?
see above
Cheers,
Thomas
More information about the pve-user
mailing list