[pve-devel] [RFC qemu] fix #3231+#3631: PVE backup: fail backup rather than guest write when backup target cannot be reached or is too slow

Fiona Ebner f.ebner at proxmox.com
Mon Jun 10 14:59:35 CEST 2024


A long-standing issue with VM backups in Proxmox VE is that a slow or
unreachable target would lead to a copy-before-write (cbw) operation
to break the guest write rather than abort the backup. This is
unexpected to users and the will end up without a successful backup
and without a working guest in such cases. This series prevents the
latter by changing the behavior to fail the backup instead of the
guest write.

This is done by re-using the already existing 'on-cbw-error' and
'cbw-timeout' options that are already used for fleecing and having
regular backup also check for the cbw's snapshot_error (unfortunately
this becomes a bit of a misnomer). If a given copy-before-write
operation cannot complete within 45 seconds, it's extremely likely
that aborting the backup is the better choice than keeping the guest
IO blocked.

Just checking for the error already makes it work (i.e. without the
last two patches), but backup can only check the error at the end. To
abort backup immediately, an error callback for the copy-before-write
node is introduced. A potential alternative would be give the
block-copy operation a pointer to the snapshot_error and have it check
it during its operation, but my initial attempt failed. Likely I
missed adapting certain logic that checks for whether the block-copy
operation failed and it's questionable if this approach would be
cleaner. An error callback is nice and explicit.

Note for testers: if e.g. the PBS is compeletly unreachable, the
backup job still will need to wait until the in-flight request is
aborted after 15 minutes. But the guest writes should be fast again.

Should it really be required to make the option more flexible, i.e.
allow users to specify a custom timeout or go back to the old behavior
then the 'backup' QMP call can be extended with those parameters.

Unfortunately, this is a non-trivial amount of code to make it work,
but there is quite a bit of boilerplate and some comments, so
hopefully the logic is straight-forward enough.


The first patch can be applied regardless of whether we want to go
with this or not.


Fiona Ebner (7):
  PVE backup: fleecing: properly set minimum cluster size
  block/copy-before-write: allow passing additional options for
    bdrv_cbw_append()
  block/backup: allow passing additional options for copy-before-write
    upon job creation
  block/backup: make cbw error also fail backup that does not use
    fleecing
  fix #3231+#3631: PVE backup: add timeout for copy-before-write
    operations and fail backup instead of guest writes
  block/copy-before-write: allow specifying error callback
  block/backup: set callback for cbw errors

 block/backup.c                         | 57 +++++++++++++++++++++++++-
 block/copy-before-write.c              | 41 +++++++++++++++---
 block/copy-before-write.h              |  9 +++-
 block/replication.c                    |  2 +-
 blockdev.c                             |  2 +-
 include/block/block_int-global-state.h |  2 +
 pve-backup.c                           | 13 +++++-
 7 files changed, 115 insertions(+), 11 deletions(-)

-- 
2.39.2





More information about the pve-devel mailing list