[pve-devel] Two-Node HA

Thu Sep 29 09:53:19 CEST 2016

Hello,
another option for 2-Node HA would be what HA-Lizzard for XenServer does.
Basically they test if some external ip`s (e.g. storage/switches) can be 
reached to ensure quorum/majority in the two-node setup. This is maybe 
not the best solution but way better than running into split-brain.

Alex

Am 29.09.16 um 08:09 schrieb Thomas Lamprecht:
> Hi Andreas,
>
>
> On 09/28/2016 05:53 PM, Andreas Steinel wrote:
>> Hi Thomas,
>>
>> Thank you for your time and your answer.
>>
>> I wonder why e.g. an Oracle Real Application Cluster (RAC) works so
>> well with a 2 node in a HA setup. We deployed 50+ clusters in the last
>> years and never had a split-brain-like situation. Rolling-updates, as
>> well as occasional host crashes are also possible without loosing data
>> - sometimes even sessions. If you use Transparent Failover (TAF), your
>> database sessions will be migrated to the other node, rolled back and
>> restarted (of course application support required on the "client"
>> side). It's not perfect, but most of the time. We had only a few total
>> crashes, but mainly due to storage issues, but also due to some bugs
>> in the cluster stack.
>
> A bit of a lengthy explanation below why this comparison may not work,
> IMHO.
>
> The ORAC and a Proxmox VE do different stuff, one is a application with
> quasi fail-silent characteristics running on the application level, the
> other is an operating system running on bare metal, with byzantine errors
> possible.
>
> With RAC you serve clients, if a client does not reach you he ask another
> server, if you're dead you sync up when starting again, you are a closed
> system which know what runs inside and how the other server react if
> something happens, but if the communication between clusters are 
> broken but
> not between clients and two clients write on the same dataset, each on
> another server, each with other data you will also get problems, a merge
> conflict, in certain situation you can solve it, databases are here often
> simpler as they can just say the newer entry "wins" and the older is 
> out to
> date and would have been over written nonetheless, so I guess here RAC 
> can
> utilize this.
> But what to you do if two VMs write on the same block on a shared 
> storage,
> the block can for each VM represent a different thing, a decision without
> manual intervention is here in general impossible.
>
> I mean our cluster filesystem can work also like this and has never 
> (known)
> split brains, even in two node clusters when one failed and the other was
> set to have quorum, we have a (relative) small task to solve and have 
> thus
> more possibilities on less possible errors, as we have less to think 
> about
> it. So it's not that we are in general not able to do such things but 
> there
> are different limitation when doing different things. :)
>
> As Proxmox VE serves Virtual Guest systems and effectively knows nothing
> about them and has a harder time ensuring that if it recovers it really
> recovers and does not cause more corruption than recovery.  Also there is
> shared access to resources, storage as already mentioned above, or IP
> address collisions, ...
> So as "third level" disaster recovery (first being application level, 
> second
> hardware level) we need stricter rules to follow, we need fencing and we
> need to ensure that we are not a failed node itself, thus we need quorum.
> And quorum between two nodes will get you a tie in the case of a failure.
>
> In a lot of cases you could buy three a little bit smaller ones 
> instead of
> two heavy machines, more redundancy, better load balancing possible, 
> real HA
> possible, but yes, it may be not suitable in every situation - I 
> understand
> that.
> Also you need three nonetheless, 2 PVE + shared storage, so a possibility
> would be also removing the shared storage node (which probably is a 
> single
> point of failure one way or the other and surely not cheap) and use three
> nodes with a decentralized storage technology, ceph, gluster, 
> sheepdog, ...
>
> So nothing against two node clusters, those are really great for a lot of
> people but if someone wants really HA then those are not enough, also
> simply three nodes are not enough, redundancy has to happen at all levels
> then: power supplies, network, shared storage ...
>
>>
>> Nevertheless, it's very good to see that a simple third vote solution
>> is on the horizon, which could be easily integrated in a RPi or an
>> even less "powerhungry" machine.
>
> I would not mark the RPi as "powerhungry" :D But yes its a cool idea in
> general.
>
> cheers,
> Thomas
>
>> Best,
>> Andreas
>>
>> On Wed, Sep 28, 2016 at 3:46 PM, Thomas Lamprecht
>> <t.lamprecht at proxmox.com> wrote:
>>> Hi,
>>>
>>> QDisks are not ideal and those itself will probably not supported by 
>>> Proxmox
>>> VE, also I would really love top see the term "two node HA" vanish, 
>>> as its
>>> only marketing talk and is technically simply not possible (sadly basic
>>> rules of our universe make it impossible), they call a setup with three
>>> voters (the two nodes + the storage node) two node HA to sound 
>>> better...
>>>
>>> That said, rant aside, there are plans to add the corosync (our cluster
>>> communication stack) QDevice daemon which allows then qdevices (at the
>>> moment there is only QNetd) to provide votes for one or more cluster.
>>>
>>> This QNetd device may run on a non Proxmox VE node and uses TCP/IP to
>>> communicate with the cluster.
>>>
>>> So you can have a two node cluster, setup the qdevice daemon there 
>>> and the
>>> qnetd daemon on your storage box which then provides the third vote 
>>> needed
>>> to allow recovery on a failure of one of the two Proxmox VE nodes.
>>>
>>> Patches for this are already on the list, whats mainly missing is -
>>> obviously - reviewing them and documentation of this all (which I'm 
>>> doing
>>> atm).
>>>
>>> cheers,
>>> Thomas
>>>
>>>
>>>
>>> On 09/28/2016 03:26 PM, Andreas Steinel wrote:
>>>> Hi,
>>>>
>>>> I'd like to ask if there are any plans to use e.g. the shared storage
>>>> as a quorum/voting disk like the oracle grid infrastructure uses it to
>>>> get a two node ha cluster (for almost a decade). This obviously only
>>>> works for NAS or SAN storage.
>>>>
>>>> Best,
>>>> Andreas
>>>> _______________________________________________
>>>> pve-devel mailing list
>>>> pve-devel at pve.proxmox.com
>>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>>>
>>>
>>> _______________________________________________
>>> pve-devel mailing list
>>> pve-devel at pve.proxmox.com
>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>> _______________________________________________
>> pve-devel mailing list
>> pve-devel at pve.proxmox.com
>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel