[pbs-devel] [PATCH proxmox{, -backup} v9 00/49] fix #2943: S3 storage backend for datastores
Christian Ebner
c.ebner at proxmox.com
Mon Jul 21 17:37:30 CEST 2025
On 7/21/25 5:05 PM, Lukas Wagner wrote:
> Retested these patches on the latest master branch(es).
>
> Retested basic backups, sync jobs, verification, GC, pruning, etc.
>
> This time I tried to focus more on different failure scenarios, e.g. a
> failing connection to the S3 server during different operations.
>
> Here's what I found, most of these issues I did already discuss and
> debug off-list with @Chris:
>
> 1.)
>
> When doing an S3 Refresh and PBS cannot connect to S3, a `tmp_xxxxxxx`
> directory is left over in the local datastore directory. After clearing
> S3 Refresh maintenance mode (or doing a successful S3 refresh), GC jobs
> will fail because they cannot access this left-over directory (it is
> owned by root:root).
> AFAIK Chris has already prepared a fix for this.
Will be fixed in the next version of the patch series, thanks!
>
> 2.)
>
> I backed up some VMs to my local MinIO server which ran out of disk
> space during backup. Since even delete operations failed in this
> scenario, PBS could not clean up the snapshot directory, which was
> left over after this failed backup. In some instances, the snapshot
> directory was completely empty, in some other case two blobs were
> written, but the fidx files were missing:
>
> root at pbs-s3:/s3-store/ns/pali/vm# ls 160/2025-07-21T12\:51\:44Z/
> fw.conf.blob qemu-server.conf.blob
> root at pbs-s3:/s3-store/ns/pali/vm# ls 165/
> 2025-07-21T12:52:42Z/ owner
> root at pbs-s3:/s3-store/ns/pali/vm# ls 165/2025-07-21T12\:52\:42Z/
> root at pbs-s3:/s3-store/ns/pali/vm#
>
> I could fix this by doing a "S3 Refresh" and then manually deleting the
> affected snapshot under the "Content" view - something that could be
> very annoying if one has hundred/thousands of snapshots, so I think we
> need some form of automatic cleanup for fragments from incomplete/failed
> backups. After all, I'm pretty sure that one could end up in a similar
> situation by just cutting the network connection to the S3 server at the
> right moment in time.
As discussed already a bit off-list, this would indeed be nice to have,
however I see no way of doing this consistently atm without manual user
interaction. In your tests cleanup of objects from the s3 backend failed
because of the out-of-memory, so the user needs to fix that first.
And automatic cleanup of fragments from the S3 store after a connection
loss might be doable during garbage collection, or verification, I will
however have to think this through in detail. So best for a followup.
>
> 3.)
>
> Cut the connection to my MinIO server during a verification job.
> The task log was spammed by the following messages:
>
> 2025-07-21T16:06:51+02:00: failed to copy corrupt chunk on s3 backend: 747835eb948591da7c4ebe892a9eb28c0daa8978bb80b70350f5b07225a1b9b0
> 2025-07-21T16:06:51+02:00: corrupted chunk renamed to "/s3-store/.chunks/7478/747835eb948591da7c4ebe892a9eb28c0daa8978bb80b70350f5b07225a1b9b0.0.bad"
> 2025-07-21T16:06:51+02:00: "can't verify chunk, load failed - client error (Connect)"
> 2025-07-21T16:06:51+02:00: failed to copy corrupt chunk on s3 backend: 5680458c0dba35dd1b528b5e38d32d410aee285f4d0328bbd8814fb5eb129aaf
> 2025-07-21T16:06:51+02:00: corrupted chunk renamed to "/s3-store/.chunks/5680/5680458c0dba35dd1b528b5e38d32d410aee285f4d0328bbd8814fb5eb129aaf.0.bad"
>
> While not really catastrophic, since these chunks would then just be
> refetched from S3 on the next access, this probably should be handled
> better/more gracefully.
Fixed this as well already for the upcoming v10 of the patches, thanks!
>
> One thing that I spotted in the documentation was the following:
>
> proxmox-backup-manager s3 client create my-s3-client --secrets-id my-s3-client ...
>
> The user has to specify the client ID twice, one for the regular config,
> one for the secret config. This was implemented this way due to how
> parameter flattening for API type structs work. I discussed this
> with @Chris and suggested another approach, one that works without
> duplicating the ID to hopefully make the UX a bit nicer.
Same, this will be fixed with the next iteration.
> Apart from these issues everything seemed to work fine.
>
> Tested-by: Lukas Wagner <l.wagner at proxmox.com>
More information about the pbs-devel
mailing list