[pve-devel] [PATCH docs] initial documentation for qdevice

Oguz Bektas o.bektas at proxmox.com
Mon Mar 4 14:15:13 CET 2019

Authored by: Thomas Lamprecht <t.lamprecht at proxmox.com>
Co-Authored by: Oguz Bektas <o.bektas at proxmox.com>
Signed-off-by: Oguz Bektas <o.bektas at proxmox.com>

This is still WIP, hence leaving the TODO at the end.

 pvecm.adoc | 150 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 150 insertions(+)

diff --git a/pvecm.adoc b/pvecm.adoc
index 0384d81..0144d11 100644
--- a/pvecm.adoc
+++ b/pvecm.adoc
@@ -753,6 +753,156 @@ If you cannot reboot the whole cluster ensure no High Availability services are
 configured and the stop the corosync service on all nodes. After corosync is
 stopped on all nodes start it one after the other again.
+Corosync External Vote Support
+This section describes a way to deploy an external voter in a {pve} cluster.
+When configured, the cluster can sustain more node failures without
+violating safety properties of the cluster communication.
+For this to work there are two services involved:
+* a so called qdevice daemon which runs on each {pve} node
+* an external vote daemon which runs on an independent server.
+As a result you can achieve higher availability even in smaller setups (for
+example 2+1 nodes).
+QDevice Technical Overview
+The Corosync Quroum Device (QDevice) is a daemon which runs on each cluster
+node. It provides a configured number of votes to the clusters quorum
+subsystem based on an external running third-party arbitrator's decision.
+Its primary use is to allow a cluster to sustain more node failures than
+standard quorum rules allow. This can be done safely as the external device
+can see all nodes and thus choose only one set of nodes to give its vote.
+This will only be done if said set of nodes can quorate (again) when
+receiving the third-party vote.
+Currently only 'QDevice Net' is supported as a third-party arbitrator. It is
+a daemon which provides a vote to a cluster partition if it can reach the
+partition members over the network. It will give only votes to one partition
+of a cluster at any time.
+It's designed to support multiple clusters and is almost configuration and
+state free. New clusters are handled dynamically and no configuration file
+is needed on the host running a QDevice.
+The external host has the only requirement that it needs network access to the
+cluster and a corosync-qnetd package available. We provide such a package
+for Debian based hosts, other Linux distributions should also have a package
+available through their respective package manager.
+NOTE: In contrast to corosync itself, a QDevice connects to the cluster over
+TCP/IP and thus does not need a multicast capable network between itself and
+the cluster. In fact the daemon may run outside of the LAN and can have
+longer latencies than 2 ms.
+Supported Setups
+We support QDevices for clusters with an even number of nodes and recommend
+it for 2 node clusters, if they should provide higher availability.
+For clusters with an odd node count we discourage the use of QDevices
+currently. The reason for this, is the difference of the votes the QDevice
+provides for each cluster type. Even numbered clusters get single additional
+vote, with this we can only increase availability, i.e. if the QDevice
+itself fails we are in the same situation as with no QDevice at all.
+Now, with an odd numbered cluster size the QDevice provides '(N-1)' votes --
+where 'N' corresponds to the cluster node count. This difference makes
+sense, if we had only one additional vote the cluster can get into a split
+brain situation.
+This algorithm would allow that all nodes but one (and naturally the
+QDevice itself) could fail.
+There are two drawbacks with this:
+* If the QNet daemon itself fails, no other node may fail or the cluster
+  immediately loses quorum.  For example, in a cluster with 15 nodes 7
+  could fail before the cluster becomes inquorate. But, if a QDevice is
+  configured here and said QDevice fails itself **no single node** of
+  the 15 may fail. The QDevice acts almost as a single point of failure in
+  this case.
+* The fact that all but one node plus QDevice may fail sound promising at
+  first, but this may result in a mass recovery of HA services that would
+  overload the single node left. Also ceph server will stop to provide
+  services after only '((N-1)/2)' nodes are online.
+If you understand the drawbacks and implications you can decide yourself if
+you should use this technology in an odd numbered cluster setup.
+QDevice-Net Setup
+We recommend to run any daemon which provides votes to corosync-qdevice as an
+unprivileged user.  {pve} and Debian Stretch provide a package which is
+already configured to do so.
+The traffic between the daemon and the cluster must be encrypted to ensure a
+safe and secure QDevice integration in {pve}.
+First install the 'corosync-qnetd' package on your external server and
+the 'corosync-qdevice' package on all cluster nodes.
+After that, ensure that all your nodes on the cluster are online.
+You can now easily set up your QDevice by running the following command on one
+of the {pve} nodes:
+pve# pvecm qdevice setup <QDEVICE-IP>
+The SSH key from the cluster will be automatically copied to the QDevice. You
+might need to enter an SSH password during this step.
+After you enter the password and all the steps are successfully completed, you
+will see "Done". You can check the status now:
+pve# pvecm status
+Votequorum information
+Expected votes:   3
+Highest expected: 3
+Total votes:      3
+Quorum:           2
+Flags:            Quorate Qdevice
+Membership information
+    Nodeid      Votes    Qdevice Name
+    0x00000001          1    A,V,NMW (local)
+    0x00000002          1    A,V,NMW
+    0x00000000          1            Qdevice
+which means the QDevice is set up.
+Frequently Asked Questions
+Tie Breaking
+In case of a tie, where two same-sized cluster partitions cannot see each
+other but the QDevice, the QDevice chooses randomly one of those partitions and
+provides a vote to it.
+Still TODO
+There ist still stuff to add here
 Corosync Configuration

More information about the pve-devel mailing list