[pbs-devel] [PATCH proxmox{, -backup} v9 00/49] fix #2943: S3 storage backend for datastores

Mon Jul 21 17:37:30 CEST 2025

On 7/21/25 5:05 PM, Lukas Wagner wrote:
> Retested these patches on the latest master branch(es).
> 
> Retested basic backups, sync jobs, verification, GC, pruning, etc.
> 
> This time I tried to focus more on different failure scenarios, e.g. a
> failing connection to the S3 server during different operations.
> 
> Here's what I found, most of these issues I did already discuss and
> debug off-list with @Chris:
> 
> 1.)
> 
> When doing an S3 Refresh and PBS cannot connect to S3, a `tmp_xxxxxxx`
> directory is left over in the local datastore directory. After clearing
> S3 Refresh maintenance mode (or doing a successful S3 refresh), GC jobs
> will fail because they cannot access this left-over directory (it is
> owned by root:root).
> AFAIK Chris has already prepared a fix for this.

Will be fixed in the next version of the patch series, thanks!

> 
> 2.)
> 
> I backed up some VMs to my local MinIO server which ran out of disk
> space during backup. Since even delete operations failed in this
> scenario, PBS could not clean up the snapshot directory, which was
> left over after this failed backup. In some instances, the snapshot
> directory was completely empty, in some other case two blobs were
> written, but the fidx files were missing:
> 
>    root at pbs-s3:/s3-store/ns/pali/vm# ls 160/2025-07-21T12\:51\:44Z/
>    fw.conf.blob  qemu-server.conf.blob
>    root at pbs-s3:/s3-store/ns/pali/vm# ls 165/
>    2025-07-21T12:52:42Z/ owner
>    root at pbs-s3:/s3-store/ns/pali/vm# ls 165/2025-07-21T12\:52\:42Z/
>    root at pbs-s3:/s3-store/ns/pali/vm#
> 
> I could fix this by doing a "S3 Refresh" and then manually deleting the
> affected snapshot under the "Content" view - something that could be
> very annoying if one has hundred/thousands of snapshots, so I think we
> need some form of automatic cleanup for fragments from incomplete/failed
> backups. After all, I'm pretty sure that one could end up in a similar
> situation by just cutting the network connection to the S3 server at the
> right moment in time.

As discussed already a bit off-list, this would indeed be nice to have, 
however I see no way of doing this consistently atm without manual user 
interaction. In your tests cleanup of objects from the s3 backend failed 
because of the out-of-memory, so the user needs to fix that first.

And automatic cleanup of fragments from the S3 store after a connection 
loss might be doable during garbage collection, or verification, I will 
however have to think this through in detail. So best for a followup.

> 
> 3.)
> 
> Cut the connection to my MinIO server during a verification job.
> The task log was spammed by the following messages:
> 
>    2025-07-21T16:06:51+02:00: failed to copy corrupt chunk on s3 backend: 747835eb948591da7c4ebe892a9eb28c0daa8978bb80b70350f5b07225a1b9b0
>    2025-07-21T16:06:51+02:00: corrupted chunk renamed to "/s3-store/.chunks/7478/747835eb948591da7c4ebe892a9eb28c0daa8978bb80b70350f5b07225a1b9b0.0.bad"
>    2025-07-21T16:06:51+02:00: "can't verify chunk, load failed - client error (Connect)"
>    2025-07-21T16:06:51+02:00: failed to copy corrupt chunk on s3 backend: 5680458c0dba35dd1b528b5e38d32d410aee285f4d0328bbd8814fb5eb129aaf
>    2025-07-21T16:06:51+02:00: corrupted chunk renamed to "/s3-store/.chunks/5680/5680458c0dba35dd1b528b5e38d32d410aee285f4d0328bbd8814fb5eb129aaf.0.bad"
> 
> While not really catastrophic, since these chunks would then just be
> refetched from S3 on the next access, this probably should be handled
> better/more gracefully.

Fixed this as well already for the upcoming v10 of the patches, thanks!

> 
> One thing that I spotted in the documentation was the following:
> 
>    proxmox-backup-manager s3 client create my-s3-client --secrets-id my-s3-client ...
> 
> The user has to specify the client ID twice, one for the regular config,
> one for the secret config. This was implemented this way due to how
> parameter flattening for API type structs work. I discussed this
> with @Chris and suggested another approach, one that works without
> duplicating the ID to hopefully make the UX a bit nicer.

Same, this will be fixed with the next iteration.

> Apart from these issues everything seemed to work fine.
> 
> Tested-by: Lukas Wagner <l.wagner at proxmox.com>