[pve-devel] [PATCH docs 2/2] Update pvecm documentation for corosync 3

Tue Jul 9 10:20:21 CEST 2019

Thanks for feedback!

Regarding patch 1/2: I grep'd through the sources and could not find any 
references to the heading names I changed. A quick look through the GUI 
also didn't reveal any obvious references.

Some of my own notes inline, I will send v2 today.

On 7/9/19 9:19 AM, Thomas Lamprecht wrote:
> On 7/8/19 6:26 PM, Stefan Reiter wrote:
>> Parts about multicast and RRP have been removed entirely. Instead, a new
>> section 'Corosync Redundancy' has been added explaining the concept of
>> links and link priorities.
>>
> 
> note bad at all, still some notes inline.
> 
>> Signed-off-by: Stefan Reiter <s.reiter at proxmox.com>
>> ---
>>   pvecm.adoc | 372 +++++++++++++++++++++--------------------------------
>>   1 file changed, 147 insertions(+), 225 deletions(-)
>>
>> diff --git a/pvecm.adoc b/pvecm.adoc
>> index 1c0b9e7..1246111 100644
>> --- a/pvecm.adoc
>> +++ b/pvecm.adoc
>> @@ -56,13 +56,8 @@ Grouping nodes into a cluster has the following advantages:
>>   Requirements
>>   ------------
>>   
>> -* All nodes must be in the same network as `corosync` uses IP Multicast
>> - to communicate between nodes (also see
>> - http://www.corosync.org[Corosync Cluster Engine]). Corosync uses UDP
>> - ports 5404 and 5405 for cluster communication.
>> -+
>> -NOTE: Some switches do not support IP multicast by default and must be
>> -manually enabled first.
>> +* All nodes must be able to contact each other via UDP ports 5404 and 5405 for
>> + corosync to work.
>>   
>>   * Date and time have to be synchronized.
>>   
>> @@ -84,6 +79,11 @@ NOTE: While it's possible for {pve} 4.4 and {pve} 5.0 this is not supported as
>>   production configuration and should only used temporarily during upgrading the
>>   whole cluster from one to another major version.
>>   
>> +NOTE: Mixing {pve} 6.x and earlier versions is not supported, because of the
>> +major corosync upgrade. While possible to run corosync 3 on {pve} 5.4, this
>> +configuration is not supported for production environments and should only be
>> +used while upgrading a cluster.
>> +
>>   
>>   Preparing Nodes
>>   ---------------
>> @@ -96,10 +96,12 @@ Currently the cluster creation can either be done on the console (login via
>>   `ssh`) or the API, which we have a GUI implementation for (__Datacenter ->
>>   Cluster__).
>>   
>> -While it's often common use to reference all other nodenames in `/etc/hosts`
>> -with their IP this is not strictly necessary for a cluster, which normally uses
>> -multicast, to work. It maybe useful as you then can connect from one node to
>> -the other with SSH through the easier to remember node name.
>> +While it's common to reference all nodenames and their IPs in `/etc/hosts` (or
>> +make their names resolveable through other means), this is not strictly
>> +necessary for a cluster to work. It may be useful however, as you can then
>> +connect from one node to the other with SSH via the easier to remember node
>> +name. (see also xref:pvecm_corosync_addresses[Link Address Types])
>> +
>>   
>>   [[pvecm_create_cluster]]
>>   Create the Cluster
>> @@ -113,31 +115,12 @@ node names.
>>    hp1# pvecm create CLUSTERNAME
>>   ----
>>   
>> -CAUTION: The cluster name is used to compute the default multicast address.
>> -Please use unique cluster names if you run more than one cluster inside your
>> -network. To avoid human confusion, it is also recommended to choose different
>> -names even if clusters do not share the cluster network.
> 
> Maybe move this from a "CAUTION" to a "NOTE" and keep the hint that it still
> makes sense to use unique cluster names, to avoid human confusion and as I have
> a feeling that there are other assumption in corosync which depend on that.
> Also, _if_ multicast gets integrated into knet we probably have a similar issue
> again, so try to bring people in lane now already, even if not 100% required.
> 

Makes sense. I just wanted to avoid mentioning multicast in the general 
instructions, to avoid people reading the docs for the first time being 
confused if they need it or not.

>> -
>>   To check the state of your cluster use:
>>   
>>   ----
>>    hp1# pvecm status
>>   ----
>>   
>> -Multiple Clusters In Same Network
>> -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> -
>> -It is possible to create multiple clusters in the same physical or logical
>> -network. Each cluster must have a unique name, which is used to generate the
>> -cluster's multicast group address. As long as no duplicate cluster names are
>> -configured in one network segment, the different clusters won't interfere with
>> -each other.
>> -
>> -If multiple clusters operate in a single network it may be beneficial to setup
>> -an IGMP querier and enable IGMP Snooping in said network. This may reduce the
>> -load of the network significantly because multicast packets are only delivered
>> -to endpoints of the respective member nodes.
>> -
> 
> It's still possible to create multiple clusters in the same network, so I'd keep
> above and just adapt to non-multicast for now..
> 
>>   
>>   [[pvecm_join_node_to_cluster]]
>>   Adding Nodes to the Cluster
>> @@ -150,7 +133,7 @@ Login via `ssh` to the node you want to add.
>>   ----
>>   
>>   For `IP-ADDRESS-CLUSTER` use the IP or hostname of an existing cluster node.
>> -An IP address is recommended (see xref:pvecm_corosync_addresses[Ring Address Types]).
>> +An IP address is recommended (see xref:pvecm_corosync_addresses[Link Address Types]).
> 
> Maybe somewhere a note that while the new things are named "Link" the config
> still refers to "ringX_addr" for backward compatibility.
> 
>>   
>>   CAUTION: A new node cannot hold any VMs, because you would get
>>   conflicts about identical VM IDs. Also, all existing configuration in
>> @@ -173,7 +156,7 @@ Date:             Mon Apr 20 12:30:13 2015
>>   Quorum provider:  corosync_votequorum
>>   Nodes:            4
>>   Node ID:          0x00000001
>> -Ring ID:          1928
>> +Ring ID:          1/8
>>   Quorate:          Yes
>>   
>>   Votequorum information
>> @@ -217,15 +200,15 @@ Adding Nodes With Separated Cluster Network
>>   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>   
>>   When adding a node to a cluster with a separated cluster network you need to
>> -use the 'ringX_addr' parameters to set the nodes address on those networks:
>> +use the 'link0' parameter to set the nodes address on that network:
>>   
>>   [source,bash]
>>   ----
>> -pvecm add IP-ADDRESS-CLUSTER -ring0_addr IP-ADDRESS-RING0
>> +pvecm add IP-ADDRESS-CLUSTER -link0 LOCAL-IP-ADDRESS-LINK0
>>   ----
>>   
>> -If you want to use the Redundant Ring Protocol you will also want to pass the
>> -'ring1_addr' parameter.
>> +If you want to use the built-in xref:pvecm_redundancy[redundancy] of the
>> +kronosnet transport you can also pass the 'link1' parameter.
>>   
>>   
>>   Remove a Cluster Node
>> @@ -283,7 +266,7 @@ Date:             Mon Apr 20 12:44:28 2015
>>   Quorum provider:  corosync_votequorum
>>   Nodes:            3
>>   Node ID:          0x00000001
>> -Ring ID:          1992
>> +Ring ID:          1/8
>>   Quorate:          Yes
>>   
>>   Votequorum information
>> @@ -302,8 +285,8 @@ Membership information
>>   0x00000003          1 192.168.15.92
>>   ----
>>   
>> -If, for whatever reason, you want that this server joins the same
>> -cluster again, you have to
>> +If, for whatever reason, you want this server to join the same cluster again,
>> +you have to
>>   
>>   * reinstall {pve} on it from scratch
>>   
>> @@ -329,14 +312,14 @@ storage with another cluster, as storage locking doesn't work over cluster
>>   boundary. Further, it may also lead to VMID conflicts.
>>   
>>   Its suggested that you create a new storage where only the node which you want
>> -to separate has access. This can be an new export on your NFS or a new Ceph
>> +to separate has access. This can be a new export on your NFS or a new Ceph
>>   pool, to name a few examples. Its just important that the exact same storage
>>   does not gets accessed by multiple clusters. After setting this storage up move
>>   all data from the node and its VMs to it. Then you are ready to separate the
>>   node from the cluster.
>>   
>>   WARNING: Ensure all shared resources are cleanly separated! You will run into
>> -conflicts and problems else.
>> +conflicts and problems otherwise.
>>   
>>   First stop the corosync and the pve-cluster services on the node:
>>   [source,bash]
>> @@ -400,6 +383,7 @@ the nodes can still connect to each other with public key authentication. This
>>   should be fixed by removing the respective keys from the
>>   '/etc/pve/priv/authorized_keys' file.
>>   
>> +
>>   Quorum
>>   ------
>>   
>> @@ -419,12 +403,13 @@ if it loses quorum.
>>   
>>   NOTE: {pve} assigns a single vote to each node by default.
>>   
>> +
>>   Cluster Network
>>   ---------------
>>   
>>   The cluster network is the core of a cluster. All messages sent over it have to
>> -be delivered reliable to all nodes in their respective order. In {pve} this
>> -part is done by corosync, an implementation of a high performance low overhead
>> +be delivered reliably to all nodes in their respective order. In {pve} this
>> +part is done by corosync, an implementation of a high performance, low overhead
>>   high availability development toolkit. It serves our decentralized
>>   configuration file system (`pmxcfs`).
>>   
>> @@ -432,75 +417,59 @@ configuration file system (`pmxcfs`).
>>   Network Requirements
>>   ~~~~~~~~~~~~~~~~~~~~
>>   This needs a reliable network with latencies under 2 milliseconds (LAN
>> -performance) to work properly. While corosync can also use unicast for
>> -communication between nodes its **highly recommended** to have a multicast
>> -capable network. The network should not be used heavily by other members,
>> -ideally corosync runs on its own network.
>> -*never* share it with network where storage communicates too.
>> +performance) to work properly. The network should not be used heavily by other
>> +members, ideally corosync runs on its own network. Do not use a shared network
>> +for corosync and storage (except as a potential low-priority fallback in a
>> +xref:pvecm_redundancy[redundant] configuration).
>>   
>>   Before setting up a cluster it is good practice to check if the network is fit
>> -for that purpose.
>> +for that purpose. With corosync 3, it is enough to ensure all nodes can reach
>> +each other over the interfaces you are planning to use. Using `ping` is enough
>> +for a basic test.
>>   
>> -* Ensure that all nodes are in the same subnet. This must only be true for the
>> -  network interfaces used for cluster communication (corosync).
>> +If the {pve} firewall is enabled, ACCEPT rules for corosync will automatically
>> +be generated - no manual action is required.
> 
> "will automatically be generated" vs "will be automatically generated"?
> 

or "will be generated automatically" ;)

I feel like all of these are correct though.

>>   
>> -* Ensure all nodes can reach each other over those interfaces, using `ping` is
>> -  enough for a basic test.
>> +NOTE: Corosync used Multicast before version 3.0 (introduced in {pve} 6.0).
>> +Modern versions rely on https://kronosnet.org/[Kronosnet] for cluster
>> +communication, which uses regular UDP unicast.
> 
> "... which, for now, only supports regular UDP unicast."?
> 
>>   
>> -* Ensure that multicast works in general and a high package rates. This can be
>> -  done with the `omping` tool. The final "%loss" number should be < 1%.
>> -+
>> -[source,bash]
>> -----
>> -omping -c 10000 -i 0.001 -F -q NODE1-IP NODE2-IP ...
>> -----
>> -
>> -* Ensure that multicast communication works over an extended period of time.
>> -  This uncovers problems where IGMP snooping is activated on the network but
>> -  no multicast querier is active. This test has a duration of around 10
>> -  minutes.
>> -+
>> -[source,bash]
>> -----
>> -omping -c 600 -i 1 -q NODE1-IP NODE2-IP ...
>> -----
>> -
>> -Your network is not ready for clustering if any of these test fails. Recheck
>> -your network configuration. Especially switches are notorious for having
>> -multicast disabled by default or IGMP snooping enabled with no IGMP querier
>> -active.
>> -
>> -In smaller cluster its also an option to use unicast if you really cannot get
>> -multicast to work.
>> +CAUTION: You can still enable Multicast or legacy unicast by setting your
>> +transport to `udp` or `udpu` in your xref:pvecm_edit_corosync_conf[corosync.conf],
>> +but keep in mind that this will disable all cryptography and redundancy support.
>> +This is therefore not recommended.
> 
> off-topic: what I general see as a loss are the omping checks, they could be used
> to get connection and latencies stats from all of the cluster nodes easily, that was
> nice to get a feeling of the whole network...
> 
> 
> 

Is it still good to leave this out then? Should I mention it somewhere?

Felt like its a bit misleading though, since it requires multicast, and 
may not work on networks where corosync is just fine.

>>   
>>   Separate Cluster Network
>>   ~~~~~~~~~~~~~~~~~~~~~~~~
>>   
>>   When creating a cluster without any parameters the cluster network is generally
>> -shared with the Web UI and the VMs and its traffic. Depending on your setup
>> +shared with the Web UI and the VMs and their traffic. Depending on your setup,
>>   even storage traffic may get sent over the same network. Its recommended to
>>   change that, as corosync is a time critical real time application.
>>   
>> +NOTE: Setting up corosync links on a different network does not affect other
>> +cluster communications (e.g. Web UI, default migration network, etc...).
>> +
> 
> This note is a bit confusing, IMO, what to you want to tell here?
> 

Couldn't find a better way to word, but it's something that confused me 
a lot personally reading the doc for the first time. Basically meaning 
that if you change the corosync network, it will not affect other 
cluster communication.

We use the term "cluster network" a lot, and it took me a while to 
realize, that this "cluster network" is completely unrelated to 
corosync. It makes sense if you know it, but it confused me just looking 
at the doc.

I'll find a way to reword it or take it out for v2.

> snip
> 

Rest makes perfect sense, will be fixed.