[pbs-devel] [PATCH proxmox{, -backup} v9 00/49] fix #2943: S3 storage backend for datastores

Mon Jul 21 17:05:10 CEST 2025

Retested these patches on the latest master branch(es).

Retested basic backups, sync jobs, verification, GC, pruning, etc.

This time I tried to focus more on different failure scenarios, e.g. a
failing connection to the S3 server during different operations.

Here's what I found, most of these issues I did already discuss and
debug off-list with @Chris:

1.)

When doing an S3 Refresh and PBS cannot connect to S3, a `tmp_xxxxxxx`
directory is left over in the local datastore directory. After clearing
S3 Refresh maintenance mode (or doing a successful S3 refresh), GC jobs
will fail because they cannot access this left-over directory (it is
owned by root:root).
AFAIK Chris has already prepared a fix for this.

2.)

I backed up some VMs to my local MinIO server which ran out of disk
space during backup. Since even delete operations failed in this
scenario, PBS could not clean up the snapshot directory, which was
left over after this failed backup. In some instances, the snapshot
directory was completely empty, in some other case two blobs were
written, but the fidx files were missing:

  root at pbs-s3:/s3-store/ns/pali/vm# ls 160/2025-07-21T12\:51\:44Z/
  fw.conf.blob  qemu-server.conf.blob
  root at pbs-s3:/s3-store/ns/pali/vm# ls 165/
  2025-07-21T12:52:42Z/ owner                 
  root at pbs-s3:/s3-store/ns/pali/vm# ls 165/2025-07-21T12\:52\:42Z/
  root at pbs-s3:/s3-store/ns/pali/vm#

I could fix this by doing a "S3 Refresh" and then manually deleting the
affected snapshot under the "Content" view - something that could be
very annoying if one has hundred/thousands of snapshots, so I think we
need some form of automatic cleanup for fragments from incomplete/failed
backups. After all, I'm pretty sure that one could end up in a similar
situation by just cutting the network connection to the S3 server at the
right moment in time.

3.)

Cut the connection to my MinIO server during a verification job.
The task log was spammed by the following messages:

  2025-07-21T16:06:51+02:00: failed to copy corrupt chunk on s3 backend: 747835eb948591da7c4ebe892a9eb28c0daa8978bb80b70350f5b07225a1b9b0
  2025-07-21T16:06:51+02:00: corrupted chunk renamed to "/s3-store/.chunks/7478/747835eb948591da7c4ebe892a9eb28c0daa8978bb80b70350f5b07225a1b9b0.0.bad"
  2025-07-21T16:06:51+02:00: "can't verify chunk, load failed - client error (Connect)"
  2025-07-21T16:06:51+02:00: failed to copy corrupt chunk on s3 backend: 5680458c0dba35dd1b528b5e38d32d410aee285f4d0328bbd8814fb5eb129aaf
  2025-07-21T16:06:51+02:00: corrupted chunk renamed to "/s3-store/.chunks/5680/5680458c0dba35dd1b528b5e38d32d410aee285f4d0328bbd8814fb5eb129aaf.0.bad"

While not really catastrophic, since these chunks would then just be
refetched from S3 on the next access, this probably should be handled
better/more gracefully.

One thing that I spotted in the documentation was the following:

  proxmox-backup-manager s3 client create my-s3-client --secrets-id my-s3-client ...

The user has to specify the client ID twice, one for the regular config,
one for the secret config. This was implemented this way due to how
parameter flattening for API type structs work. I discussed this
with @Chris and suggested another approach, one that works without
duplicating the ID to hopefully make the UX a bit nicer.

Apart from these issues everything seemed to work fine.

Tested-by: Lukas Wagner <l.wagner at proxmox.com>