[pve-devel] [PATCH docs v2 1/2] pvesr: update the chapter and bring it into good condition

Wed Dec 18 17:19:47 CET 2024

* restructure and revise the introduction
* add subchapter "Recommendations"
* remove the subchapter "Schedule Format" with its one line of content
  and link where appropriate directly to the copy under "25. Appendix D:
  Calendar Events". The help button at adding/editing a job links now to
  the subchapter "Managing Jobs".
* provide details on job removal and how to enforce it if necessary
* add more helpful CLI examples and improve existing ones
* restructure and revise the subchapter "Error Handling"

Signed-off-by: Alexander Zeidler <a.zeidler at proxmox.com>
---
v2:
* no changes, only add missing pve-manager patch

 pvecm.adoc |   2 +
 pvesr.adoc | 402 ++++++++++++++++++++++++++++++++++++-----------------
 2 files changed, 279 insertions(+), 125 deletions(-)

diff --git a/pvecm.adoc b/pvecm.adoc
index 15dda4e..4028e92 100644
--- a/pvecm.adoc
+++ b/pvecm.adoc
@@ -486,6 +486,7 @@ authentication. You should fix this by removing the respective keys from the
 '/etc/pve/priv/authorized_keys' file.
 
 
+[[pvecm_quorum]]
 Quorum
 ------
 
@@ -963,6 +964,7 @@ case $- in
 esac
 ----
 
+[[pvecm_external_vote]]
 Corosync External Vote Support
 ------------------------------
 
diff --git a/pvesr.adoc b/pvesr.adoc
index 9ad02f5..de29240 100644
--- a/pvesr.adoc
+++ b/pvesr.adoc
@@ -24,48 +24,65 @@ Storage Replication
 :pve-toplevel:
 endif::manvolnum[]
 
-The `pvesr` command-line tool manages the {PVE} storage replication
-framework. Storage replication brings redundancy for guests using
-local storage and reduces migration time.
-
-It replicates guest volumes to another node so that all data is available
-without using shared storage. Replication uses snapshots to minimize traffic
-sent over the network. Therefore, new data is sent only incrementally after
-the initial full sync. In the case of a node failure, your guest data is
-still available on the replicated node.
-
-The replication is done automatically in configurable intervals.
-The minimum replication interval is one minute, and the maximal interval
-once a week. The format used to specify those intervals is a subset of
-`systemd` calendar events, see
-xref:pvesr_schedule_time_format[Schedule Format] section:
-
-It is possible to replicate a guest to multiple target nodes,
-but not twice to the same target node.
-
-Each replications bandwidth can be limited, to avoid overloading a storage
-or server.
-
-Only changes since the last replication (so-called `deltas`) need to be
-transferred if the guest is migrated to a node to which it already is
-replicated. This reduces the time needed significantly. The replication
-direction automatically switches if you migrate a guest to the replication
-target node.
-
-For example: VM100 is currently on `nodeA` and gets replicated to `nodeB`.
-You migrate it to `nodeB`, so now it gets automatically replicated back from
-`nodeB` to `nodeA`.
-
-If you migrate to a node where the guest is not replicated, the whole disk
-data must send over. After the migration, the replication job continues to
-replicate this guest to the configured nodes.
+Storage replication is particularly interesting for small clusters if
+guest volumes are placed on a local storage instead of a shared one.
+By replicating the volumes to other cluster nodes, guest migration to
+those nodes will become significantly faster.
+
+In the event of a node or local storage failure, the volume data as of
+the latest completed replication runs are still available on the
+replication target nodes.
 
 [IMPORTANT]
 ====
-High-Availability is allowed in combination with storage replication, but there
-may be some data loss between the last synced time and the time a node failed.
+While a replication-enabled guest can be configured for
+xref:chapter_ha_manager[high availability], or
+xref:pvesr_node_failed[manually moved] while its origin node is not
+available, read about the involved
+xref:pvesr_risk_of_data_loss[risk of data loss] and how to avoid it.
 ====
 
+.Replication requires …
+
+* at least one other cluster node as a replication target
+* one common local storage entry in the datacenter, being functional
+on both nodes
+* that the local storage type is
+xref:pvesr_supported_storage[supported by replication]
+* that guest volumes are stored on that local storage
+
+.Replication …
+
+* allows a fast migration to nodes where the guest is being replicated
+* provides guest volume redundancy in a cluster where using a shared
+storage type is not an option
+* is configured as a job for a guest, with multiple jobs enabling
+multiple replication targets
+* jobs run one after the other at their configured interval (shortest
+is every minute)
+* uses snapshots to regularly transmit only changed volume data
+(so-called deltas)
+* network bandwidth can be limited per job, smoothing the storage and
+network utilization
+* targets stay basically the same when migrating the guest to another
+node
+* direction of a job reverses when moving the guest to its configured
+replication target
+
+.Example:
+
+A guest runs on node `A` and has replication jobs to node `B` and `C`,
+both with a set interval of every five minutes (`*/5`). Now we migrate
+the guest from `A` to `B`, which also automatically updates the
+replication targets for this guest to be `A` and `C`. Migration was
+completed fast, as only the changed volume data since the last
+replication run has been transmitted.
+
+In the event that node `B` or its local storage fails, the guest can
+be restarted on `A` or `C`, with the risk of some data loss as
+described in this chapter.
+
+[[pvesr_supported_storage]]
 Supported Storage Types
 -----------------------
 
@@ -76,147 +93,282 @@ Supported Storage Types
 |ZFS (local)    |zfspool     |yes      |yes
 |=============================================
 
-[[pvesr_schedule_time_format]]
-Schedule Format
+[[pvesr_recommendations]]
+Recommendations
 ---------------
-Replication uses xref:chapter_calendar_events[calendar events] for
-configuring the schedule.
 
-Error Handling
---------------
+[[pvesr_risk_of_data_loss]]
+Risk of Data Loss
+~~~~~~~~~~~~~~~~~
+
+If a node should suddenly become unavailable for a longer period of
+time, it may become neccessary to run a guest on a replication target
+node instead. Thereby the guest will use the latest replicated volume
+data available on the chosen target node. That volume state will then
+also be replicated to other nodes with the next replication runs,
+since the replication directions are automatically updated for related
+jobs. This also means, that the once newer volume state on the failed
+node will be removed after it becomes available again.
+
+A more resilient solution may be to use a shared
+xref:chapter_storage[storage type] instead. If that is not an option,
+consider setting the replication job intervals short enough and avoid
+moving replication-configured guests while their origin node is not
+available. Instead of configuring those guests for high availability,
+xref:qm_startup_and_shutdown[start at boot] could be a sufficient
+alternative.
+
+[[pvesr_replication_network]]
+Network for Replication Traffic
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Replication traffic is routed via the
+xref:pvecm_migration_network[migration network]. If it is not set, the
+management network is used by default, which can have a negative
+impact on corosync and therefore on cluster availability. To specify
+the migration network, navigate to
+__Datacenter -> Options -> Migration Settings__, or set it via CLI in
+the xref:datacenter_configuration_file[`datacenter.cfg`].
+
+[[pvesr_cluster_size]]
+Cluster Size
+~~~~~~~~~~~~
+
+With a 2-node cluster in particular, the failure of one node can leave
+the other node without a xref:pvecm_quorum[quorum]. In order to keep
+the cluster functional at all times, it is therefore crucial to
+xref:pvecm_join_node_to_cluster[expand] to a 3-node cluster in advance
+or to configure a xref:pvecm_external_vote[QDevice] for the third
+vote.
+
+[[pvesr_managing_jobs]]
+Managing Jobs
+-------------
 
-If a replication job encounters problems, it is placed in an error state.
-In this state, the configured replication intervals get suspended
-temporarily. The failed replication is repeatedly tried again in a
-30 minute interval.
-Once this succeeds, the original schedule gets activated again.
+[thumbnail="screenshot/gui-qemu-add-replication-job.png"]
 
-Possible issues
-~~~~~~~~~~~~~~~
+Replication jobs can easily be created, modified and removed via web
+interface, or by using the CLI tool `pvesr`.
 
-Some of the most common issues are in the following list. Depending on your
-setup there may be another cause.
+To manage all replication jobs in one place, go to
+__Datacenter -> Replication__. Additional functionalities are
+available under __Node -> Replication__ and __Guest -> Replication__.
+Go there to view logs, schedule a job once for now, or benefit from
+preset fields when configuring a job.
 
-* Network is not working.
+Enabled replication jobs will automatically run at their set interval,
+one after the other. The default interval is at every quarter of an
+hour (`*/15`), and can be set to as often as every minute (`*/1`), see
+xref:chapter_calendar_events[schedule format].
 
-* No free space left on the replication target storage.
+Optionally, the network bandwidth can be limited, which also helps to
+keep the storage load on the target node acceptable.
 
-* Storage with the same storage ID is not available on the target node.
+Shortly after job creation, a first snapshot is taken and sent to the
+target node. Subsequent snapshots are taken at the set interval and
+only contain modified volume data, allowing a significantly shorter
+transfer time.
 
-NOTE: You can always use the replication log to find out what is causing the problem.
+If you remove a replication job, the snapshots on the target node are
+also getting deleted again by default. The removal takes place at the
+next possible point in time and requires the job to be enabled. If the
+target node is permanently unreachable, the cleanup can be skipped by
+forcing a job deletion via CLI.
 
-Migrating a guest in case of Error
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-// FIXME: move this to better fitting chapter (sysadmin ?) and only link to
-// it here
+When not using the web interface, the cluster-wide unique replication
+job ID has to be specified. For example, `100-0`, which is composed of
+the guest ID, a hyphen and an arbitrary job number.
 
-In the case of a grave error, a virtual guest may get stuck on a failed
-node. You then need to move it manually to a working node again.
+[[pvesr_cli_examples]]
+CLI Examples
+------------
 
-Example
-~~~~~~~
+Create a replication job for guest `100` and give it the job number
+`0`. Replicate to node `pve2` every five minutes (`*/5`), at a maximum
+network bandwitdh of `10` MBps (megabytes per second).
 
-Let's assume that you have two guests (VM 100 and CT 200) running on node A
-and replicate to node B.
-Node A failed and can not get back online. Now you have to migrate the guest
-to Node B manually.
+----
+# pvesr create-local-job 100-0 pve2 --schedule "*/5" --rate 10
+----
 
-- connect to node B over ssh or open its shell via the web UI
+List replication jobs from all nodes.
 
-- check if that the cluster is quorate
-+
 ----
-# pvecm status
+# pvesr list
 ----
 
-- If you have no quorum, we strongly advise to fix this first and make the
-  node operable again. Only if this is not possible at the moment, you may
-  use the following command to enforce quorum on the current node:
-+
+List the job statuses from all local guests, or only from a specific
+local guest.
+
 ----
-# pvecm expected 1
+# pvesr status [--guest 100]
 ----
 
-WARNING: Avoid changes which affect the cluster if `expected votes` are set
-(for example adding/removing nodes, storages, virtual guests) at all costs.
-Only use it to get vital guests up and running again or to resolve the quorum
-issue itself.
+Read the configuration of job `100-0`.
 
-- move both guest configuration files form the origin node A to node B:
-+
 ----
-# mv /etc/pve/nodes/A/qemu-server/100.conf /etc/pve/nodes/B/qemu-server/100.conf
-# mv /etc/pve/nodes/A/lxc/200.conf /etc/pve/nodes/B/lxc/200.conf
+# pvesr read 100-0
 ----
 
-- Now you can start the guests again:
-+
+Update the configuration of job `100-0`, for example, to change the
+schedule interval to every full hour.
+
 ----
-# qm start 100
-# pct start 200
+# pvesr update 100-0 --schedule "*/00"
 ----
 
-Remember to replace the VMIDs and node names with your respective values.
+To run the job `100-0` once soon, schedule it regardless of the
+configured interval.
 
-Managing Jobs
--------------
+----
+# pvesr schedule-now 100-0
+----
 
-[thumbnail="screenshot/gui-qemu-add-replication-job.png"]
+Disable (or `enable`) the job `100-0`.
+
+----
+# pvesr disable 100-0
+----
+
+Delete the job `100-0`. If the target node is permanently unreachable,
+`--force` can be used to skip the failing cleanup.
 
-You can use the web GUI to create, modify, and remove replication jobs
-easily. Additionally, the command-line interface (CLI) tool `pvesr` can be
-used to do this.
+----
+# pvesr delete 100-0 [--force]
+----
 
-You can find the replication panel on all levels (datacenter, node, virtual
-guest) in the web GUI. They differ in which jobs get shown:
-all, node- or guest-specific jobs.
+[[pvesr_error_handling]]
+Error Handling
+--------------
 
-When adding a new job, you need to specify the guest if not already selected
-as well as the target node. The replication
-xref:pvesr_schedule_time_format[schedule] can be set if the default of `all
-15 minutes` is not desired. You may impose a rate-limit on a replication
-job. The rate limit can help to keep the load on the storage acceptable.
+[[pvesr_job_failed]]
+Job Failed
+~~~~~~~~~~
 
-A replication job is identified by a cluster-wide unique ID. This ID is
-composed of the VMID in addition to a job number.
-This ID must only be specified manually if the CLI tool is used.
+In the event that a replication job fails, it is temporarily placed in
+an error state and a notification is sent. A retry is scheduled for 5
+minutes later, followed by another 10, 15 and finally every 30
+minutes. As soon as the job has run successfully again, the error
+state is left and the configured interval is resumed.
 
-Network
--------
+.Troubleshooting Job Failures
 
-Replication traffic will use the same network as the live guest migration. By
-default, this is the management network. To use a different network for the
-migration, configure the `Migration Network` in the web interface under
-`Datacenter -> Options -> Migration Settings` or in the `datacenter.cfg`. See
-xref:pvecm_migration_network[Migration Network] for more details.
+To find out why a job exactly failed, read the log available under
+__Node -> Replication__.
 
-Command-line Interface Examples
--------------------------------
+Common causes are:
 
-Create a replication job which runs every 5 minutes with a limited bandwidth
-of 10 Mbps (megabytes per second) for the guest with ID 100.
+* The network is not working properly.
+* The storage (ID) in use has set an availability restriction,
+excluding the target node.
+* The storage is not set up correctly on the target node (e.g.
+different pool name).
+* The storage on the target node has no free space left.
+
+[[pvesr_node_failed]]
+Origin Node Failed
+~~~~~~~~~~~~~~~~~~
+// FIXME: move this to better fitting chapter (sysadmin ?) and only link to
+// it here
 
+In the event that a node running replicated guests fails suddenly and
+for too long, it may become necessary to restart these guests on their
+replicated nodes. If replicated guests are configured for high
+availability (HA), beside its involved
+xref:pvesr_risk_of_data_loss[risk of data loss], just wait until these
+guests are recovered on other nodes. Replicated guests which are not
+configured for HA can be moved manually as explained below, including
+the same risk of data loss.
+
+[[pvesr_find_latest_replicas]]
+.Step 1: Optionally Decide on a Specific Replication Target Node
+
+To minimize the data loss of an important guest, you can find the
+target node on which the most recent successful replication took
+place. If the origin node is healthy enough to access its web
+interface, go to __Node -> Replication__ and see the 'Last Sync'
+column. Alternatively, you can carry out the following steps.
+
+. To list all target nodes of an important guest, exemplary with the
+ID `1000`, go to the CLI of any node and run:
++
 ----
-# pvesr create-local-job 100-0 pve1 --schedule "*/5" --rate 10
+# pvesr list | grep -e Job -e ^1000
 ----
 
-Disable an active job with ID `100-0`.
+. Open the CLI on all listed target nodes.
 
+. Adapt the following command with your VMID to find the most recent
+snapshots among your target nodes. If snapshots were taken in the same
+minute, look for the highest number at the end of the name, which is
+the Unix timestamp.
++
 ----
-# pvesr disable 100-0
+# zfs list -t snapshot -o name,creation | grep -e -1000-disk
 ----
 
-Enable a deactivated job with ID `100-0`.
+[[pvesr_verify_cluster_health]]
+.Step 2: Verify Cluster Health
+
+Go to the CLI of any replication target node and run `pvecm status`.
+If the output contains `Quorate: Yes`, then the cluster/corosync is
+healthy enough and you can proceed with
+xref:pvesr_move_a_guest[Step 3: Move a guest].
 
+WARNING: If the cluster is not quorate and consists of 3 or more
+nodes/votes, we strongly recommend to solve the underlying problem
+first so that at least the majority of nodes/votes are available
+again.
+
+If the cluster is not quorate and consists of only 2 nodes without an
+additional xref:pvecm_external_vote[QDevice], you may want to proceed
+with the following steps to temporary make the cluster functional
+again.
+
+. Check whether the expected votes are `2`.
++
 ----
-# pvesr enable 100-0
+# pvecm status | grep votes
 ----
 
-Change the schedule interval of the job with ID `100-0` to once per hour.
+. Now you can enforce quorum on the one remaining node by running:
++
+----
+# pvecm expected 1
+----
++
+WARNING: Avoid making changes to the cluster in this state at all
+costs, for example adding or removing nodes, storages or guests. Delay
+it until the second node is available again and expected votes have
+been automatically restored to `2`.
+
+[[pvesr_move_a_guest]]
+.Step 3: Move a Guest
 
+. Use SSH to connect to any node that is part of the cluster majority.
+Alternatively, go to the web interface and open the shell of such node
+in a separate window or browser tab.
++
+. The following example commands move a VMID `1000` and CTID `2000`
+from the node named `pve-failed` to a still available replication
+target node named `pve-replicated`.
++
+----
+# cd /etc/pve/nodes/
+# mv pve-failed/qemu-server/1000.conf pve-replicated/qemu-server/
+# mv pve-failed/lxc/2000.conf pve-replicated/lxc/
+----
++
+. Now you can start those guests again:
++
 ----
-# pvesr update 100-0 --schedule '*/00'
+# qm start 1000
+# pct start 2000
 ----
++
+. If it was necessary to enforce the quorum, as described when
+verifying the cluster health, do not forget the warning at the end
+about avoiding changes to the cluster.
 
 ifdef::manvolnum[]
 include::pve-copyright.adoc[]
-- 
2.39.5