[pbs-devel] [PATCH proxmox-backup v2] fix #6566: backup: api: conditionally drop group and snapshot locks
Christian Ebner
c.ebner at proxmox.com
Tue Sep 30 11:21:11 CEST 2025
On 9/30/25 10:44 AM, Fabian Grünbichler wrote:
> On September 29, 2025 5:17 pm, Christian Ebner wrote:
>> To guarantee consistency by possible concurrent operations, the
>> backup protocol locks the backup group, the previous backup
>> snapshot (if any) and holds a lock for the newly created backup
>> snapshot. All of these are currently stored in the backup worker
>> task, only released on its destruction.
>>
>> The backup API however signals a successful backup via the return
>> status of the `finish` call, while still holding the locks.
>> Therefore, an immediate subsequent backup of the client to the same
>> group can fail because the locks cannot be acquired until the previous
>> backup task is completely destroyed, which can however outlive the
>> `finish` return for some time. This manifests in e.g. a push sync job
>> failing.
>>
>> To fix this, store the lock guards inside the RPC environments shared
>> state instead, allowing to selectively drop the locks on successful
>> backup finish. On error, hold the locks until the cleanup was
>> successful.
>>
>> Immediate verification of new snapshots already downgraded the lock
>> by dropping the exclusive lock and getting a shared lock. Since the
>> dropping is now already handled by the finish call, only gathering
>> the shared lock is required. While there is now a larger time window
>> for concurrent prunes, the underlying possible race between
>> verification and prune remains in place.
>>
>> Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=6566
>> Signed-off-by: Christian Ebner <c.ebner at proxmox.com>
>
> Reviewed-by: Fabian Grünbichler <f.gruenbichler at proxmox.com>
>
> it should be very rare that a snapshot is attempted to be removed right
> as it was created, so the slightly increased race window there should be
> okay.
>
> did you test this with vzdump's `protected` option and verify after
> completion?
I did not manage to race the backup finish <-> verify after finished.
But there is the (pre-existing) race between pruning and setting the
protected marker. This can be triggered if timed right, e.g. just did
with the latest rebased version. The backup task then fails with, e.g.:
```
...
INFO: stopping kvm after backup task
INFO: adding notes to backup
WARN: unable to add notes - proxmox-backup-client failed: Error: unable
to update manifest blob - unable to load blob
'"/devel/datastore/vm/100/2025-09-30T09:09:40Z/index.json.blob"' - No
such file or directory (os error 2)
INFO: marking backup as protected
ERROR: Backup of VM 100 failed - unable to set protected flag - 400 Bad
Request
INFO: Failed at 2025-09-30 11:09:52
...
```
To fix that, we would need to extend the backup upgrade api endpoint by
a protected marker flag, so this can be set a-priori and written when
the manifest is persisted to disk.
More information about the pbs-devel
mailing list