[pbs-devel] [PATCH v4 proxmox-backup] fix #5710: api: backup: stat known chunks on backup finish

Mon Nov 25 22:42:27 CET 2024

Am 08.10.24 um 11:46 schrieb Christian Ebner:
> Known chunks are expected to be present on the datastore a-priori,
> allowing clients to only re-index these chunks without uploading the
> raw chunk data. The list of reusable known chunks is send to the
> client by the server, deduced from the indexed chunks of the previous
> backup snapshot of the group.
> 
> If however such a known chunk disappeared (the previous backup
> snapshot having been verified before that or not verified just yet),
> the backup will finish just fine, leading to a seemingly successful
> backup. Only a subsequent verification job will detect the backup
> snapshot as being corrupt.
> 
> In order to reduce the impact, stat the list of previously known
> chunks when finishing the backup. If a missing chunk is detected, the
> backup run itself will fail and the previous backup snapshots verify
> state is set to failed.
> This prevents the same snapshot from being reused by another,
> subsequent backup job.
> 
> Note:
> The current backup run might have been just fine, if the now missing
> known chunk is not indexed. But since there is no straight forward
> way to detect which known chunks have not been reused in the fast
> incremental mode for fixed index backups, the backup run is
> considered failed.
> 
> link to issue in bugtracker:
> https://bugzilla.proxmox.com/show_bug.cgi?id=5710
> 
> Signed-off-by: Christian Ebner <c.ebner at proxmox.com>
> Tested-by: Gabriel Goller <g.goller at proxmox.com>
> Reviewed-by: Gabriel Goller <g.goller at proxmox.com>
> ---
> Changes since version 3, thanks to Gabriel for additional comments:
> - Use anyhow error context also for manifest update error
> - Use `with_context` over mapping the error, which is more concise
> 
> Changes since version 2, thanks to Gabriel for testing and review:
> - Use and display anyhow error context
> - s/backp/backup/
> 
> Changes since version 1, thanks to Dietmar and Gabriel for feedback:
> - Only stat on backup finish
> - Distinguish newly uploaded from previously known chunks, to be able
>   to only stat the latter.
> 
> New test on my side show a performance degradation of ~2% for the VM
> backup and about ~10% for the LXC backup as compared to an unpatched
> server.
> In contrast to version 1 of the patches the PBS datastore this time
> was located on an NFS share backed by an NVME SSD.
> 
> I did perform vzdump backups of a VM with a 32G disk attached and a
> LXC container with a Debian install and rootfs of ca. 400M (both off,
> no changes in data in-between backup runs).
> Again performed 5 runs each after an initial run to assure full chunk
> presence on server and valid previous snapshot.
> 
> Here the updated figures:
> 
> -----------------------------------------------------------
> patched                    | unpatched
> -----------------------------------------------------------
> VM           | LXC         | VM           | LXC
> -----------------------------------------------------------
> 14.0s ± 0.8s | 2.2s ± 0.1s | 13.7s ± 0.5s | 2.0s ± 0.03s
> -----------------------------------------------------------

please include this stuff in the actual commit message, it's nice to see as
point-in-time sample when reading the git log.
A comparison with bigger disks, say 1 TB, would be additionally great to see
how this scales with big disk size.