[pve-devel] [PATCH v2 docs] pvecm.adoc; : qdevice: Adapt, update and make it clearer

Thomas Lamprecht t.lamprecht at proxmox.com
Wed Mar 25 15:05:38 CET 2020


On 3/25/20 2:36 PM, Aaron Lauterer wrote:
> Naming the whole mechanism and one of the daemons the same makes it easy
> to mix up the two. This patch aims to make the whole understanding of
> the QDevice and its parts easier by
> 
> * describing use cases at the beginning
> * making the distinction between the QDevice mechanism and qdevice
> daemon clear
> * adding one more item to the FAQ section to troubleshoot
> * fix small grammer and sentence structures.
> * phrasing some parts in a simpler fashion
> 
> Signed-off-by: Aaron Lauterer <a.lauterer at proxmox.com>
> ---
> v1 -> v2:
> * remove troubleshooting for failed corosync-qdevice start
>    this should be fixed with the repackaged corosync-qdevice.deb
> * added the section about adding a qdevice to an odd sized cluster and
>   tried to make the whole (N-1) explanation easier to understand.
> 
> I had this patch in the pipeline for a while and finally got around to
> fix it up.
> 
> Feedback regarding the understandability, grammer and spelling mistakes
> is welcome
> 
>  pvecm.adoc | 171 ++++++++++++++++++++++++++++-------------------------
>  1 file changed, 92 insertions(+), 79 deletions(-)
> 
> diff --git a/pvecm.adoc b/pvecm.adoc
> index f65f94d..e1fe42d 100644
> --- a/pvecm.adoc
> +++ b/pvecm.adoc
> @@ -872,107 +872,113 @@ If you see a healthy cluster state, it means that your new link is being used.
>  Corosync External Vote Support
>  ------------------------------
>  
> -This section describes a way to deploy an external voter in a {pve} cluster.
> -When configured, the cluster can sustain more node failures without
> -violating safety properties of the cluster communication.
> +It is possible to add an external voter to a {pve} cluster. This enables a
> +cluster to suffer more node failures without losing quorum. There are two
> +prominent use cases.
> +
> +The first are small two node clusters. If one node fails, the remaining node

Maybe:
"The first usescase is small two-node clusters."

> +cannot know if the other host is really down and HA guests need to be started,
> +or if the cluster communication is lost and a so called 'split brain'
> +footnote:[https://en.wikipedia.org/wiki/Split-brain_(computing)] situation has
> +occurred.

No, this is *NOT* a split brain, as corosync is exactly here to provide protection
against a split brain by loosing quorum on both nodes in such a case.

 Adding an external voting device can help mitigate such a situation.
> +
> +The second use case are larger clusters with an even number of nodes. In case of
> +problems with the cluster communication it is possible to have two partitions
> +with the same number of nodes, a 'split brain' situation. Adding an external

same here with 'split brain'

> +voting device to tip the number of possible votes to an odd number ensures that
> +there will always be a majority in one of the partitions.

I'm really not sure about this paragraphs, from a technical stand point the add
to confusion and to tell the same use case as it would be two. The two paragraphs
are semantically exactly the same, just that the first is a specific case of the
second. IMO, this adds no value, confuses just more.

> +
> +QDevice technical overview
> +~~~~~~~~~~~~~~~~~~~~~~~~~~
>  
> -For this to work there are two services involved:
> +Two parts form the 'QDevice' mechanism:
>  
> -* a so called qdevice daemon which runs on each {pve} node
> +* The `qdevice` daemon runs on each {pve} node.

`corosync-qdevice`, if we want to allow to use it 1:1 for systemctl stuff or the like.

>  
> -* an external vote daemon which runs on an independent server.
> +* The `qnetd` daemon runs on the external, independent server. It can deal with
> +multiple clusters.

`corosync-qnetd`

but note that not mentioning the qnetd has some reasoning, as theoretically it is
just one form - the system is made generally and there could be arbitrary vote
arbitrators (heh). It can make sense to mention the qnetd the way you do, because
practically it's the only really usable (and from us supported) one.

>  
> -As a result you can achieve higher availability even in smaller setups (for
> -example 2+1 nodes).

Note, our prominent setup to solve is 2 (PVE) + 1 (storage) box one, I want to
have that highlighted somehow.

> +The `qdevice` and `qnetd` daemons use TCP/IP for their communication. Low
> +latency is not such a big issue as with corosync itself. This means that the
> +`qnetd` service can even be placed outside of the clusters LAN.
>  

why switching from the bulletin points to a normal paragraph here? it is could be
also a bulletin point.

> -QDevice Technical Overview
> -~~~~~~~~~~~~~~~~~~~~~~~~~~
> +The 'QDevice' shows up as its own device with it's own vote in the cluster when
> +`pvecm status` is run. In case of a partitioned cluster the `qnetd` daemon
> +decides which partition gets the 'QDevice' vote. All nodes in a partition must
> +be reachable from the `qnetd` service on the external server to get the vote. At
> +any time only one partition of a cluster gets the vote.
>  
> -The Corosync Quorum Device (QDevice) is a daemon which runs on each cluster
> -node. It provides a configured number of votes to the clusters quorum
> -subsystem based on an external running third-party arbitrator's decision.
> -Its primary use is to allow a cluster to sustain more node failures than
> -standard quorum rules allow. This can be done safely as the external device
> -can see all nodes and thus choose only one set of nodes to give its vote.
> -This will only be done if said set of nodes can have quorum (again) when
> -receiving the third-party vote.
> -
> -Currently only 'QDevice Net' is supported as a third-party arbitrator. It is
> -a daemon which provides a vote to a cluster partition if it can reach the
> -partition members over the network. It will give only votes to one partition
> -of a cluster at any time.

IMO, some of the information gets missing from your rewrite, or blurred, I'd like
to avoid that in a technical documentation.

> -It's designed to support multiple clusters and is almost configuration and
> -state free. New clusters are handled dynamically and no configuration file
> -is needed on the host running a QDevice.
> -
> -The external host has the only requirement that it needs network access to the
> -cluster and a corosync-qnetd package available. We provide such a package
> -for Debian based hosts, other Linux distributions should also have a package
> -available through their respective package manager.
> -
> -NOTE: In contrast to corosync itself, a QDevice connects to the cluster over
> -TCP/IP. The daemon may even run outside of the clusters LAN and can have longer
> -latencies than 2 ms.
> +NOTE: The naming of 'QDevice', the mechanism, and the `qdevice` daemon can be
> +confusing at times. The `qdevice` daemon is the service running on each {pve}
> +node and in combination with the `qnetd` daemon running on an external machine
> +forms the 'QDevice' mechanism.
>  
>  Supported Setups
>  ~~~~~~~~~~~~~~~~
>  
> -We support QDevices for clusters with an even number of nodes and recommend
> -it for 2 node clusters, if they should provide higher availability.
> -For clusters with an odd node count we discourage the use of QDevices
> -currently. The reason for this, is the difference of the votes the QDevice
> -provides for each cluster type. Even numbered clusters get single additional
> -vote, with this we can only increase availability, i.e. if the QDevice
> -itself fails we are in the same situation as with no QDevice at all.
> -
> -Now, with an odd numbered cluster size the QDevice provides '(N-1)' votes --
> -where 'N' corresponds to the cluster node count. This difference makes
> -sense, if we had only one additional vote the cluster can get into a split
> -brain situation.
> -This algorithm would allow that all nodes but one (and naturally the
> -QDevice itself) could fail.
> -There are two drawbacks with this:
> -
> -* If the QNet daemon itself fails, no other node may fail or the cluster
> -  immediately loses quorum.  For example, in a cluster with 15 nodes 7
> -  could fail before the cluster becomes inquorate. But, if a QDevice is
> -  configured here and said QDevice fails itself **no single node** of
> -  the 15 may fail. The QDevice acts almost as a single point of failure in
> -  this case.
> -
> -* The fact that all but one node plus QDevice may fail sound promising at
> -  first, but this may result in a mass recovery of HA services that would
> -  overload the single node left. Also ceph server will stop to provide
> -  services after only '((N-1)/2)' nodes are online.
> -
> -If you understand the drawbacks and implications you can decide yourself if
> -you should use this technology in an odd numbered cluster setup.
> +Cluster with even node count
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +{pve} supports 'QDevices' for clusters with an even number of nodes to add one
> +additional vote and to avoid a possible 'split brain' situation.
> +
> +In a two node cluster an additional vote will allow it to stay operational if
> +one of the two nodes is down, making it possible to provide high availability.
> +
> +Cluster with odd node count
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^

this is now a sub heading of supported setups, but it is _not_ supported.

> +
> +Adding a 'QDevice' to an odd numbered cluster is discouraged. It can forced but
> +you should be aware of the implications and drawbacks.
> +
> +The algorithm is changed to Last Man Standing, 'LMS', when a 'QDevice' is
> +forcefully added to a cluster with an odd node count. The 'QDevice' gets '(N-1)'
> +votes --  where 'N' corresponds to the cluster node count.
> +
> +This avoids having an even number of votes in the cluster, which could lead to a
> +'split brain' situation. All nodes except one can fail. In combination with the
> +votes from the 'QDevice' the one node will still form a quorate cluster.
> +
> +Two drawbacks with this are:
> +
> +* If the 'qnetd' daemon on the external server fails, no other may can fail or
> +  the cluster will lose quorum.
> +  For example, in a normal cluster of 15 nodes, 7 nodes can fail before the
> +  cluster loses quorum. With a 'QDevice' added and it fails, **no single node**
> +  of the 15 can fail or the cluster loses quorum. The 'QDevice' acts almost as a
> +  single point of failure in this scenario.
> +
> +* The fact, that all but one node plus the 'QDevice' can fail may sound
> +  promising at first, but this can result in a mass recovery of HA guests that
> +  can overload the one left node. If Ceph is used, the whole cluster will have a
> +  problem a lot earlier once enough nodes for a full Ceph failure stopped
> +  working.
>  
>  QDevice-Net Setup
>  ~~~~~~~~~~~~~~~~~
>  
>  We recommend to run any daemon which provides votes to corosync-qdevice as an
> -unprivileged user. {pve} and Debian provides a package which is already
> +unprivileged user. {pve} and Debian provide a package which is already
>  configured to do so.
>  The traffic between the daemon and the cluster must be encrypted to ensure a
>  safe and secure QDevice integration in {pve}.
>  
> -First install the 'corosync-qnetd' package on your external server and
> +First install the 'corosync-qnetd' package on the external server and
>  the 'corosync-qdevice' package on all cluster nodes.
>  
> -After that, ensure that all your nodes on the cluster are online.
> +Next ensure that all nodes nodes in the cluster are online an can ping the

s/an can/and can/

> +external server.
>  
> -You can now easily set up your QDevice by running the following command on one
> -of the {pve} nodes:
> +To set up the QDevice run the the following command on one of the {pve} nodes:
>  
>  ----
>  pve# pvecm qdevice setup <QDEVICE-IP>
>  ----
>  
> -The SSH key from the cluster will be automatically copied to the QDevice. You
> -might need to enter an SSH password during this step.
> +The SSH key from the cluster will be automatically copied to the external
> +server. You might need to enter an SSH password at this step.
>  
> -After you enter the password and all the steps are successfully completed, you
> +After the password is entered and all the steps are successfully completed, you
>  will see "Done". You can check the status now:
>  
>  ----
> @@ -1006,8 +1012,8 @@ Tie Breaking
>  ^^^^^^^^^^^^
>  
>  In case of a tie, where two same-sized cluster partitions cannot see each other
> -but the QDevice, the QDevice chooses randomly one of those partitions and
> -provides a vote to it.
> +but the QDevice, the QDevice will randomly choose one of the partitions and
> +provide a vote to it.
>  
>  Possible Negative Implications
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> @@ -1020,20 +1026,27 @@ Adding/Deleting Nodes After QDevice Setup
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>  
>  If you want to add a new node or remove an existing one from a cluster with a
> -QDevice setup, you need to remove the QDevice first. After that, you can add or
> -remove nodes normally. Once you have a cluster with an even node count again,
> -you can set up the QDevice again as described above.
> +QDevice set up, you need to remove the QDevice first. After that, you can add or
> +remove nodes normally. You can set up the QDecive again should the cluster have
> +an even node count after the changes.
>  
>  Removing the QDevice
>  ^^^^^^^^^^^^^^^^^^^^
>  
>  If you used the official `pvecm` tool to add the QDevice, you can remove it
> -trivially by running:
> +by running:
>  
>  ----
>  pve# pvecm qdevice remove
>  ----
>  
> +SSH Password is not accepted
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +In case that the external server does not accept the password during the setup
> +phase, make sure that the SSH daemon on the external server is configured to
> +allow the root login with password.
> +
>  //Still TODO
>  //^^^^^^^^^^
>  //There is still stuff to add here
> 





More information about the pve-devel mailing list