[pve-devel] Two-Node HA

Thomas Lamprecht t.lamprecht at proxmox.com
Thu Sep 29 08:09:00 CEST 2016


Hi Andreas,


On 09/28/2016 05:53 PM, Andreas Steinel wrote:
> Hi Thomas,
>
> Thank you for your time and your answer.
>
> I wonder why e.g. an Oracle Real Application Cluster (RAC) works so
> well with a 2 node in a HA setup. We deployed 50+ clusters in the last
> years and never had a split-brain-like situation. Rolling-updates, as
> well as occasional host crashes are also possible without loosing data
> - sometimes even sessions. If you use Transparent Failover (TAF), your
> database sessions will be migrated to the other node, rolled back and
> restarted (of course application support required on the "client"
> side). It's not perfect, but most of the time. We had only a few total
> crashes, but mainly due to storage issues, but also due to some bugs
> in the cluster stack.

A bit of a lengthy explanation below why this comparison may not work,
IMHO.

The ORAC and a Proxmox VE do different stuff, one is a application with
quasi fail-silent characteristics running on the application level, the
other is an operating system running on bare metal, with byzantine errors
possible.

With RAC you serve clients, if a client does not reach you he ask another
server, if you're dead you sync up when starting again, you are a closed
system which know what runs inside and how the other server react if
something happens, but if the communication between clusters are broken but
not between clients and two clients write on the same dataset, each on
another server, each with other data you will also get problems, a merge
conflict, in certain situation you can solve it, databases are here often
simpler as they can just say the newer entry "wins" and the older is out to
date and would have been over written nonetheless, so I guess here RAC can
utilize this.
But what to you do if two VMs write on the same block on a shared storage,
the block can for each VM represent a different thing, a decision without
manual intervention is here in general impossible.

I mean our cluster filesystem can work also like this and has never (known)
split brains, even in two node clusters when one failed and the other was
set to have quorum, we have a (relative) small task to solve and have thus
more possibilities on less possible errors, as we have less to think about
it. So it's not that we are in general not able to do such things but there
are different limitation when doing different things. :)

As Proxmox VE serves Virtual Guest systems and effectively knows nothing
about them and has a harder time ensuring that if it recovers it really
recovers and does not cause more corruption than recovery.  Also there is
shared access to resources, storage as already mentioned above, or IP
address collisions, ...
So as "third level" disaster recovery (first being application level, second
hardware level) we need stricter rules to follow, we need fencing and we
need to ensure that we are not a failed node itself, thus we need quorum.
And quorum between two nodes will get you a tie in the case of a failure.

In a lot of cases you could buy three a little bit smaller ones instead of
two heavy machines, more redundancy, better load balancing possible, real HA
possible, but yes, it may be not suitable in every situation - I understand
that.
Also you need three nonetheless, 2 PVE + shared storage, so a possibility
would be also removing the shared storage node (which probably is a single
point of failure one way or the other and surely not cheap) and use three
nodes with a decentralized storage technology, ceph, gluster, sheepdog, ...

So nothing against two node clusters, those are really great for a lot of
people but if someone wants really HA then those are not enough, also
simply three nodes are not enough, redundancy has to happen at all levels
then: power supplies, network, shared storage ...

>
> Nevertheless, it's very good to see that a simple third vote solution
> is on the horizon, which could be easily integrated in a RPi or an
> even less "powerhungry" machine.

I would not mark the RPi as "powerhungry" :D But yes its a cool idea in
general.

cheers,
Thomas

> Best,
> Andreas
>
> On Wed, Sep 28, 2016 at 3:46 PM, Thomas Lamprecht
> <t.lamprecht at proxmox.com> wrote:
>> Hi,
>>
>> QDisks are not ideal and those itself will probably not supported by Proxmox
>> VE, also I would really love top see the term "two node HA" vanish, as its
>> only marketing talk and is technically simply not possible (sadly basic
>> rules of our universe make it impossible), they call a setup with three
>> voters (the two nodes + the storage node) two node HA to sound better...
>>
>> That said, rant aside, there are plans to add the corosync (our cluster
>> communication stack) QDevice daemon which allows then qdevices (at the
>> moment there is only QNetd) to provide votes for one or more cluster.
>>
>> This QNetd device may run on a non Proxmox VE node and uses TCP/IP to
>> communicate with the cluster.
>>
>> So you can have a two node cluster, setup the qdevice daemon there and the
>> qnetd daemon on your storage box which then provides the third vote needed
>> to allow recovery on a failure of one of the two Proxmox VE nodes.
>>
>> Patches for this are already on the list, whats mainly missing is -
>> obviously - reviewing them and documentation of this all (which I'm doing
>> atm).
>>
>> cheers,
>> Thomas
>>
>>
>>
>> On 09/28/2016 03:26 PM, Andreas Steinel wrote:
>>> Hi,
>>>
>>> I'd like to ask if there are any plans to use e.g. the shared storage
>>> as a quorum/voting disk like the oracle grid infrastructure uses it to
>>> get a two node ha cluster (for almost a decade). This obviously only
>>> works for NAS or SAN storage.
>>>
>>> Best,
>>> Andreas
>>> _______________________________________________
>>> pve-devel mailing list
>>> pve-devel at pve.proxmox.com
>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>>
>>
>> _______________________________________________
>> pve-devel mailing list
>> pve-devel at pve.proxmox.com
>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> _______________________________________________
> pve-devel mailing list
> pve-devel at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel





More information about the pve-devel mailing list