[pve-devel] [PATCH docs v3 1/2] pvesr: update the chapter and bring it into good condition
Fiona Ebner
f.ebner at proxmox.com
Thu Feb 13 15:56:21 CET 2025
Reads pretty good all in all. Maybe I was a bit nitpicky in the initial
section, but better to suggest too much than too little.
Am 10.01.25 um 17:58 schrieb Alexander Zeidler:
> diff --git a/pvecm.adoc b/pvecm.adoc
> index 15dda4e..4028e92 100644
> --- a/pvecm.adoc
> +++ b/pvecm.adoc
> @@ -486,6 +486,7 @@ authentication. You should fix this by removing the respective keys from the
> '/etc/pve/priv/authorized_keys' file.
>
>
> +[[pvecm_quorum]]
> Quorum
> ------
>
> @@ -963,6 +964,7 @@ case $- in
> esac
> ----
>
> +[[pvecm_external_vote]]
> Corosync External Vote Support
> ------------------------------
>
Nit: above could/should be its own preparatory patch
> diff --git a/pvesr.adoc b/pvesr.adoc
> index 9ad02f5..034e4c2 100644
> --- a/pvesr.adoc
> +++ b/pvesr.adoc
> @@ -24,48 +24,68 @@ Storage Replication
> :pve-toplevel:
> endif::manvolnum[]
>
> -The `pvesr` command-line tool manages the {PVE} storage replication
> -framework. Storage replication brings redundancy for guests using
> -local storage and reduces migration time.
> -
> -It replicates guest volumes to another node so that all data is available
> -without using shared storage. Replication uses snapshots to minimize traffic
> -sent over the network. Therefore, new data is sent only incrementally after
> -the initial full sync. In the case of a node failure, your guest data is
> -still available on the replicated node.
> -
> -The replication is done automatically in configurable intervals.
> -The minimum replication interval is one minute, and the maximal interval
> -once a week. The format used to specify those intervals is a subset of
> -`systemd` calendar events, see
> -xref:pvesr_schedule_time_format[Schedule Format] section:
> -
> -It is possible to replicate a guest to multiple target nodes,
> -but not twice to the same target node.
> -
> -Each replications bandwidth can be limited, to avoid overloading a storage
> -or server.
> -
> -Only changes since the last replication (so-called `deltas`) need to be
> -transferred if the guest is migrated to a node to which it already is
> -replicated. This reduces the time needed significantly. The replication
> -direction automatically switches if you migrate a guest to the replication
> -target node.
> -
> -For example: VM100 is currently on `nodeA` and gets replicated to `nodeB`.
> -You migrate it to `nodeB`, so now it gets automatically replicated back from
> -`nodeB` to `nodeA`.
> -
> -If you migrate to a node where the guest is not replicated, the whole disk
> -data must send over. After the migration, the replication job continues to
> -replicate this guest to the configured nodes.
I'd add some heading like "Introduction" or "Overview" for the initial
section.
> +Replication can be configured for a guest which has volumes placed on
> +a local storage. Those volumes are then replicated to other cluster
Nit: I'd shorten this to "for a guest with volumes on a local storage"
Would add an early sentence that only ZFS is supported right now to not
raise false expectations.
> +nodes to enable a significantly faster guest migration to them.
Sentence can be split: "other cluster nodes. This enables"
Or even better, use the chance to define what a replication target is:
"other cluster nodes, the replication targets. This enables"
Would clarify "migration to a replication target."
> +Possible additional volumes on a shared storage are not being
I'd drop the "possible additional" and the "being"
> +replicated, since it is expected that the shared storage is also
> +available at the migration target node. Replication is particularly
Can replication still be configured if a shared storage is not available
at the replication target? That would be a bug. If it is still possible,
there should be a warning in the docs that it should not be configured.
Otherwise, a sentence that it cannot be configured.
> +interesting for small clusters if no shared storage is available.
> +
> +In the event of a node or local storage failure, the volume data as of
> +the latest completed replication runs are still available on the
> +replication target nodes.
>
> [IMPORTANT]
> ====
> -High-Availability is allowed in combination with storage replication, but there
> -may be some data loss between the last synced time and the time a node failed.
> +While a replication-enabled guest can be configured for
> +xref:chapter_ha_manager[high availability], or
> +xref:pvesr_node_failed[manually moved] while its origin node is not
> +available, read about the involved
The second clause doesn't fit the first one "while ..., read about ..."
is not what you intend to say here. You could use "while ..., there is
risk of data loss. See/Read about ..."
> +xref:pvesr_risk_of_data_loss[risk of data loss] and how to avoid it.
> ====
>
> +.Replication requires …
IMHO, the ellipsis in the title looks a bit off. Do we use that anywhere
else in the documentation? I'd simply use "Requirements" (or
"Requirements:")
> +
> +* at least one other cluster node as a replication target
> +* one common local storage entry in the datacenter, being functional
"storage entry in the datacenter" might be open for interpretation,
maybe: "one common local storage definition in the dacaneter's storage
configuration" ?
> +on both nodes
> +* that the local storage type is
> +xref:pvesr_supported_storage[supported by replication]
> +* that the guest has volumes stored on that local storage
If dropping the ellipsis, also drop the "that"
> +
> +.Replication …
Same here. Although it's more difficult to come up with a good title,
maybe "Replication Facts"
> +
> +* allows a fast migration to nodes where the guest is being replicated
> +* provides guest volume redundancy in a cluster where using a shared
> +storage type is not an option
> +* is configured as a job for a guest, with multiple jobs enabling
> +multiple replication targets
> +* jobs run one after the other at their configured interval (shortest
s/interval/schedule/
> +is every minute)
> +* uses snapshots to regularly transmit only changed volume data
> +(so-called deltas)
> +* network bandwidth can be limited per job, smoothing the storage and
> +network utilization
Maybe "IO pressure" instead of "storage utilization", because the latter
usually refers to space.
> +* targets stay basically the same when migrating the guest to another
> +node
Not sure what this is supposed to mean. If this refers to the
replication targets, not sure if this is worth pointing out, especially
since the next point already mentions the interesting case.
> +* direction of a job reverses when moving the guest to its configured
> +replication target
> +
> +.Example:
> +
> +A guest runs on node `A` and has replication jobs to node `B` and `C`,
> +both with a set interval of every five minutes (`*/5`). Now we migrate
> +the guest from `A` to `B`, which also automatically updates the
> +replication targets for this guest to be `A` and `C`. Migration was
> +completed fast, as only the changed volume data since the last
> +replication run has been transmitted.
> +
> +In the event that node `B` or its local storage fails, the guest can
> +be restarted on `A` or `C`, with the risk of some data loss as
> +described in this chapter.
Should reference to the relevant section.
> +
> +[[pvesr_supported_storage]]
> Supported Storage Types
> -----------------------
>
> @@ -76,147 +96,286 @@ Supported Storage Types
> |ZFS (local) |zfspool |yes |yes
> |=============================================
>
> -[[pvesr_schedule_time_format]]
> -Schedule Format
> ----------------
> -Replication uses xref:chapter_calendar_events[calendar events] for
> -configuring the schedule.
> -
> -Error Handling
> +[[pvesr_considerations]]
> +Considerations
> --------------
>
> -If a replication job encounters problems, it is placed in an error state.
> -In this state, the configured replication intervals get suspended
> -temporarily. The failed replication is repeatedly tried again in a
> -30 minute interval.
> -Once this succeeds, the original schedule gets activated again.
> +[[pvesr_risk_of_data_loss]]
> +Risk of Data Loss
> +~~~~~~~~~~~~~~~~~
> +
> +If a node should suddenly become unavailable for a longer period of
> +time, it may become neccessary to run a guest on a replication target
> +node instead. Thereby the guest will use the latest replicated volume
> +data available on the chosen target node. That volume state will then
> +also be replicated to other nodes with the next replication runs,
> +since the replication directions are automatically updated for related
> +jobs. This also means, that the once newer volume state on the failed
Nit: how about "volume state from the time of the failure" instead of
"once newer volume state"?
> +node will be removed after it becomes available again. Possible
"overwritten" instead of "removed"
Drop the "Possible".
> +volumes on a shared storage are not affected by that, since they are
> +not being replicated.
> +
> +A more resilient solution may be to use a shared
> +xref:chapter_storage[storage type] instead. If that is not an option,
> +consider setting the replication job intervals short enough and avoid
> +moving replication-configured guests while their origin node is not
s/replication configured guests/replicated guests/
> +available. Instead of configuring those guests for high availability,
> +xref:qm_startup_and_shutdown[start at boot] could be a sufficient
> +alternative.
Since we're here already, let's add a sentence that neither replication
nor shared storages replace the need for taking regular backups :)
> +
> +[[pvesr_replication_network]]
> +Network for Replication Traffic
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Replication traffic is routed via the
> +xref:pvecm_migration_network[migration network]. If it is not set, the
> +management network is used by default, which can have a negative
> +impact on corosync and therefore on cluster availability. To specify
> +the migration network, navigate to
> +__Datacenter -> Options -> Migration Settings__, or set it via CLI in
> +the xref:datacenter_configuration_file[`datacenter.cfg`].
> +
> +[[pvesr_cluster_size]]
> +Cluster Size
> +~~~~~~~~~~~~
> +
> +With a 2-node cluster in particular, the failure of one node can leave
> +the other node without a xref:pvecm_quorum[quorum]. In order to keep
> +the cluster functional at all times, it is therefore crucial to
> +xref:pvecm_join_node_to_cluster[expand] to a 3-node cluster in advance
> +or to configure a xref:pvecm_external_vote[QDevice] for the third
> +vote.
> +
> +[[pvesr_managing_jobs]]
> +Managing Jobs
> +-------------
>
> -Possible issues
> -~~~~~~~~~~~~~~~
> +[thumbnail="screenshot/gui-qemu-add-replication-job.png"]
>
> -Some of the most common issues are in the following list. Depending on your
> -setup there may be another cause.
> +Replication jobs can easily be created, modified and removed via web
"via the web interface"
> +interface, or by using the CLI tool `pvesr`.
"the pvesr CLI tool" sounds more natural IMHO
>
> -* Network is not working.
> +To manage all replication jobs in one place, go to
> +__Datacenter -> Replication__. Additional functionalities are
> +available under __Node -> Replication__ and __Guest -> Replication__.
> +Go there to view logs, schedule a job once for now, or benefit from
> +preset fields when configuring a job.
>
> -* No free space left on the replication target storage.
> +Enabled replication jobs will automatically run at their set interval,
s/interval/schedule/
> +one after the other. You can change the default interval of every 15
> +minutes (`*/15`) by selecting or adapting an example from the
> +drop-down list. The shortest interval is every minute (`*/1`). See
> +also xref:chapter_calendar_events[schedule format].
>
> -* Storage with the same storage ID is not available on the target node.
> +If replication jobs result in significant I/O load on the target node,
> +the network bandwidth of individual jobs can be limited to keep the
> +load at an acceptable level.
>
> -NOTE: You can always use the replication log to find out what is causing the problem.
> +Shortly after job creation, a first snapshot is taken and sent to the
> +target node. Subsequent snapshots are taken according to the schedule
> +and only contain modified volume data, allowing a significantly
> +shorter transfer time.
>
> -Migrating a guest in case of Error
> -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> -// FIXME: move this to better fitting chapter (sysadmin ?) and only link to
> -// it here
> +If you remove a replication job, the snapshots on the target node are
> +also getting deleted again by default. The removal takes place at the
I'd drop the "again" (maybe also the "getting")
> +next possible point in time and requires the job to be enabled. If the
> +target node is permanently unreachable, the cleanup can be skipped by
> +forcing a job deletion via CLI.
>
> -In the case of a grave error, a virtual guest may get stuck on a failed
> -node. You then need to move it manually to a working node again.
> +When not using the web interface, the cluster-wide unique replication
I'd turn this into a positive:
"For managing jobs via CLI or API, the cluster-wide ..."
> +job ID has to be specified. For example, `100-0`, which is composed of
> +the guest ID, a hyphen and an arbitrary job number.
>
> -Example
> -~~~~~~~
> +[[pvesr_cli_examples]]
> +CLI Examples
> +------------
>
> -Let's assume that you have two guests (VM 100 and CT 200) running on node A
> -and replicate to node B.
> -Node A failed and can not get back online. Now you have to migrate the guest
> -to Node B manually.
> +Create a replication job for guest `100` and give it the job number
> +`0`. Replicate to node `pve2` every five minutes (`*/5`), at a maximum
> +network bandwitdh of `10` MBps (megabytes per second).
>
> -- connect to node B over ssh or open its shell via the web UI
> +----
> +# pvesr create-local-job 100-0 pve2 --schedule "*/5" --rate 10
> +----
> +
> +List replication jobs from all nodes.
>
> -- check if that the cluster is quorate
> -+
> ----
> -# pvecm status
> +# pvesr list
> ----
>
> -- If you have no quorum, we strongly advise to fix this first and make the
> - node operable again. Only if this is not possible at the moment, you may
> - use the following command to enforce quorum on the current node:
> -+
> +List the job statuses from all local guests, or only from a specific
> +local guest.
> +
> ----
> -# pvecm expected 1
> +# pvesr status [--guest 100]
> ----
>
> -WARNING: Avoid changes which affect the cluster if `expected votes` are set
> -(for example adding/removing nodes, storages, virtual guests) at all costs.
> -Only use it to get vital guests up and running again or to resolve the quorum
> -issue itself.
> +Read the configuration of job `100-0`.
s/Read/Show/
>
> -- move both guest configuration files form the origin node A to node B:
> -+
> ----
> -# mv /etc/pve/nodes/A/qemu-server/100.conf /etc/pve/nodes/B/qemu-server/100.conf
> -# mv /etc/pve/nodes/A/lxc/200.conf /etc/pve/nodes/B/lxc/200.conf
> +# pvesr read 100-0
> ----
>
> -- Now you can start the guests again:
> -+
> +Update the configuration of job `100-0`, for example, to change the
> +schedule interval to every full hour (`hourly`).
> +
> ----
> -# qm start 100
> -# pct start 200
> +# pvesr update 100-0 --schedule "*:00"
> ----
>
> -Remember to replace the VMIDs and node names with your respective values.
> +To run the job `100-0` once soon, schedule it regardless of the
> +configured interval.
s/interval/schedule/
>
> -Managing Jobs
> --------------
> +----
> +# pvesr schedule-now 100-0
> +----
>
> -[thumbnail="screenshot/gui-qemu-add-replication-job.png"]
> +Disable (or `enable`) the job `100-0`.
> +
> +----
> +# pvesr disable 100-0
I'd list the enable command here too, for completeness (and use the
plain word "enable" without backticks above then).
> +----
> +
> +Delete the job `100-0`. If the target node is permanently unreachable,
> +`--force` can be used to skip the failing cleanup.
>
> -You can use the web GUI to create, modify, and remove replication jobs
> -easily. Additionally, the command-line interface (CLI) tool `pvesr` can be
> -used to do this.
> +----
> +# pvesr delete 100-0 [--force]
> +----
>
> -You can find the replication panel on all levels (datacenter, node, virtual
> -guest) in the web GUI. They differ in which jobs get shown:
> -all, node- or guest-specific jobs.
> +[[pvesr_error_handling]]
> +Error Handling
> +--------------
>
> -When adding a new job, you need to specify the guest if not already selected
> -as well as the target node. The replication
> -xref:pvesr_schedule_time_format[schedule] can be set if the default of `all
> -15 minutes` is not desired. You may impose a rate-limit on a replication
> -job. The rate limit can help to keep the load on the storage acceptable.
> +[[pvesr_job_failed]]
> +Job Failed
> +~~~~~~~~~~
>
> -A replication job is identified by a cluster-wide unique ID. This ID is
> -composed of the VMID in addition to a job number.
> -This ID must only be specified manually if the CLI tool is used.
> +In the event that a replication job fails, it is temporarily placed in
> +an error state and a notification is sent. A retry is scheduled for 5
> +minutes later, followed by another 10, 15 and finally every 30
> +minutes. As soon as the job has run successfully again, the error
> +state is left and the configured interval is resumed.
s/left/cleared/
s/interval/schedule/
>
> -Network
> --------
> +.Troubleshooting Job Failures
>
> -Replication traffic will use the same network as the live guest migration. By
> -default, this is the management network. To use a different network for the
> -migration, configure the `Migration Network` in the web interface under
> -`Datacenter -> Options -> Migration Settings` or in the `datacenter.cfg`. See
> -xref:pvecm_migration_network[Migration Network] for more details.
> +To find out why a job exactly failed, read the log available under
> +__Node -> Replication__.
>
> -Command-line Interface Examples
> --------------------------------
> +Common causes are:
>
> -Create a replication job which runs every 5 minutes with a limited bandwidth
> -of 10 Mbps (megabytes per second) for the guest with ID 100.
> +* The network is not working properly.
> +* The storage (ID) in use has set an availability restriction,
> +excluding the target node.
> +* The storage is not set up correctly on the target node (e.g.
> +different pool name).
> +* The storage on the target node has no free space left.
> +
> +[[pvesr_node_failed]]
> +Origin Node Failed
> +~~~~~~~~~~~~~~~~~~
> +// FIXME: move this to better fitting chapter (sysadmin ?) and only link to
> +// it here
>
> +In the event that a node running replicated guests fails suddenly and
> +for too long, it may become necessary to restart these guests on their
> +replicated nodes. If replicated guests are configured for high
"on their replicated nodes" -> "on one of their replication targets"
> +availability (HA), beside its involved
> +xref:pvesr_risk_of_data_loss[risk of data loss], just wait until these
> +guests are recovered on other nodes. Replicated guests which are not
> +configured for HA can be moved manually as explained below, including
> +the same risk of data loss.
> +
> +[[pvesr_find_latest_replicas]]
> +.Step 1: Optionally Decide on a Specific Replication Target Node
> +
> +To minimize the data loss of an important guest, you can optionally
> +find the target node on which the most recent successful replication
> +took place. If the origin node is healthy enough to access its web
> +interface, go to __Node -> Replication__ and see the 'Last Sync'
> +column. Alternatively, you can carry out the following steps.
> +
> +. To list all target nodes of an important guest, exemplary with the
> +ID `1000`, go to the CLI of any node and run:
> ++
> ----
> -# pvesr create-local-job 100-0 pve1 --schedule "*/5" --rate 10
> +# pvesr list | grep -e Job -e ^1000
The regex should be terminated to not include matches for 1000xyz
> ----
>
> -Disable an active job with ID `100-0`.
> +. Open the CLI on all listed target nodes.
It's more convenient to use SSH
>
> +. Adapt the following command with your VMID to find the most recent
> +snapshots among your target nodes. If snapshots were taken in the same
> +minute, look for the highest number at the end of the name, which is
> +the Unix timestamp.
> ++
> ----
> -# pvesr disable 100-0
> +# zfs list -t snapshot -o name,creation | grep -e -1000-disk
If we provide such commands, then please let the machine do the sorting.
You can use -S/-s for the list commmand. And truncate the output with
head to not show all lines.
> ----
>
> -Enable a deactivated job with ID `100-0`.
> +[[pvesr_verify_cluster_health]]
> +.Step 2: Verify Cluster Health
> +
> +Go to the CLI of any replication target node and run `pvecm status`.
> +If the output contains `Quorate: Yes`, then the cluster/corosync is
> +healthy enough and you can proceed with
> +xref:pvesr_move_a_guest[Step 3: Move a guest].
>
> +WARNING: If the cluster is not quorate and consists of 3 or more
> +nodes/votes, we strongly recommend to solve the underlying problem
> +first so that at least the majority of nodes/votes are available
> +again.
> +
> +If the cluster is not quorate and consists of only 2 nodes without an
> +additional xref:pvecm_external_vote[QDevice], you may want to proceed
> +with the following steps to temporary make the cluster functional
> +again.
Not sure if we even want to keep suggesting this. And if we really do,
I'd not say "you may want to proceed", but right away clarify that it's
not recommended and just an utmost escape hatch.
> +
> +. Check whether the expected votes are `2`.
> ++
> ----
> -# pvesr enable 100-0
> +# pvecm status | grep votes
> ----
>
> -Change the schedule interval of the job with ID `100-0` to once per hour.
> +. Now you can enforce quorum on the one remaining node by running:
> ++
> +----
> +# pvecm expected 1
> +----
> ++
> +WARNING: Avoid making changes to the cluster in this state at all
> +costs, for example adding or removing nodes, storages or guests. Delay
> +it until the second node is available again and expected votes have
> +been automatically restored to `2`.
> +
> +[[pvesr_move_a_guest]]
> +.Step 3: Move a Guest
>
> +. Use SSH to connect to any node that is part of the cluster majority.
"quorate majority"
> +Alternatively, go to the web interface and open the shell of such node
> +in a separate window or browser tab.
> ++
> +. The following example commands move a VMID `1000` and CTID `2000`
> +from the node named `pve-failed` to a still available replication
> +target node named `pve-replicated`.
> ++
> +----
> +# cd /etc/pve/nodes/
> +# mv pve-failed/qemu-server/1000.conf pve-replicated/qemu-server/
> +# mv pve-failed/lxc/2000.conf pve-replicated/lxc/
> +----
> ++
> +. Now you can start those guests again:
> ++
> ----
> -# pvesr update 100-0 --schedule '*/00'
> +# qm start 1000
> +# pct start 2000
> ----
> ++
> +. If it was necessary to enforce the quorum, as described when
> +verifying the cluster health, do not forget the warning at the end
> +about avoiding changes to the cluster.
>
> ifdef::manvolnum[]
> include::pve-copyright.adoc[]
More information about the pve-devel
mailing list