[pve-devel] Two-Node HA

Thu Sep 29 13:33:55 CEST 2016

Thank you Thomas, I really like the discussion

On Thu, Sep 29, 2016 at 8:09 AM, Thomas Lamprecht
<t.lamprecht at proxmox.com> wrote:
> [...]
>
> The ORAC and a Proxmox VE do different stuff, one is a application with
> quasi fail-silent characteristics running on the application level, the
> other is an operating system running on bare metal, with byzantine errors
> possible.

I'm on the level comparing Grid Infrastructure against corosync and I
do not see much difference.
LXC and QEmu are also applications (with kernel modules, but also the
oracle grid infrastructure does that).
Do I miss something in the picture or is this much oversimplified?

> With RAC you serve clients, if a client does not reach you he ask another
> server, if you're dead you sync up when starting again, you are a closed
> system which know what runs inside and how the other server react if
> something happens, but if the communication between clusters are broken but
> not between clients and two clients write on the same dataset, each on
> another server, each with other data you will also get problems, a merge
> conflict, in certain situation you can solve it, databases are here often
> simpler as they can just say the newer entry "wins" and the older is out to
> date and would have been over written nonetheless, so I guess here RAC can
> utilize this.
> But what to you do if two VMs write on the same block on a shared storage,
> the block can for each VM represent a different thing, a decision without
> manual intervention is here in general impossible.

I'm just referring to the grid infrastructure, which is below the
RAC-enabled database (in a tiered environment) and handles primarily
only resources including a shared storage system with raid capability
and a cluster file system on top of the shared storage system. It
communicates through a interconnect network and uses network as well
as storage interchange to ensure that both nodes can see each other.
Whenever you loose communication to the storage backend, the node will
crash the cluster stack and everything running on top of it, if they
"only" loose the network based interconnect, they will communicate
over the storage backend and will crash then according to their
decision. This works pretty good and the HA stack automatically
migrates resources on switch-over or fail-over and also on recovery.
All this has nothing to do with the database itself, which only runs
as a resource with a different os user on the HA stack itself. I'm not
deeply familiar with corosync, but its main purpose is to provide
resources in a HA manner, isn't it?

> As Proxmox VE serves Virtual Guest systems and effectively knows nothing
> about them and has a harder time ensuring that if it recovers it really
> recovers and does not cause more corruption than recovery.  Also there is
> shared access to resources, storage as already mentioned above, or IP
> address collisions, ...

All this is also part of the grid infrastructure, even the IP stuff.

> So as "third level" disaster recovery (first being application level, second
> hardware level) we need stricter rules to follow, we need fencing and we
> need to ensure that we are not a failed node itself, thus we need quorum.
> And quorum between two nodes will get you a tie in the case of a failure.

Is this still a problem if you can communicate over storage (e.g. real
shared storage)? Maybe that's a corner case, because normally, you
would only have a "real" shared storage (not software based) if you
have a SAN you would have lot of money and then you'll probably buy at
least 3 nodes etc....

There is also a market for "cluster-in-a-box" stuff (e.g. Fujitsu
PRIMEFLEX) like 2 HE with two servers and shared storage (ordinary
dual-channel drives normally used in HA-SANs). It would be great to be
able to sell these things especially with Proxmox VE in a HA setup.

I remember the old days with stonith devices on cheap hardware, it also worked.

> Also you need three nonetheless, 2 PVE + shared storage, so a possibility
> would be also removing the shared storage node (which probably is a single
> point of failure one way or the other and surely not cheap) and use three
> nodes with a decentralized storage technology, ceph, gluster, sheepdog, ...

In the cluster environment we usually operate, we always have the "not
cheap" ones based on FC-SAN. It's a prerequisite for Oracle RAC and
RAC itself is not that cheap either.

> So nothing against two node clusters, those are really great for a lot of
> people but if someone wants really HA then those are not enough, also
> simply three nodes are not enough, redundancy has to happen at all levels
> then: power supplies, network, shared storage ...

Yes, that's clear.