[pbs-devel] [PATCH proxmox-backup v2 0/2] fix #6750: fix possible deadlock for s3 backed datastore backups
Christian Ebner
c.ebner at proxmox.com
Fri Sep 26 12:35:43 CEST 2025
On 9/26/25 12:26 PM, Fabian Grünbichler wrote:
> On September 26, 2025 10:42 am, Christian Ebner wrote:
>> These patches aim to fix a deadlock which can occur during backup
>> jobs to datastores backed by S3 backend. The deadlock most likely is
>> caused by the mutex guard for the backup shared state being held
>> while entering the tokio::task::block_in_place context and executing
>> async code, which however can lead to deadlocks as described in [0].
>>
>> Therefore, these patches avoid holding the mutex guard for the shared
>> backup state while performing the s3 backend operations, by
>> prematurely dropping it. To avoid inconsistencies, introduce flags
>> to keep track of the index writers closing state and add a transient
>> `Finishing` state to be entered during manifest updates.
>>
>> Changes since version 1 (thanks @Fabian):
>> - Use the shared backup state's writers in addition with a closed flag
>> instead of counting active backend operations.
>> - Replace finished flag with BackupState enum to introduce the new,
>> transient `Finishing` state to be entered during manifest updates.
>> - Add missing checks and refactor code to the now mutable reference when
>> accessing the shared backup state in the respective close calls.
>
> this looks a lot better!
>
> but I think we both missed one more problematic code path:
>
> - env.remove_backup() (sync)
> -- locks state
> -- calls pbs_datastore::datastore::remove_backup() (sync)
> --- calls pbs_datastore::backup_info::BackupDir::destroy (sync)
> ---- calls proxmox_async_runtime::block_on(s3_client.delete_objects_by_prefix)
Good catch!
> this one is only called in mod.rs *after* the backup session processing
> is completed, I am not even sure why we call into the env there (all we
> do with it is set the state to finished, but that has no effect at that
> point anymore AFAICT?)
Must double check, but that might be related to allowing the client
connection to disappear without further error?
> maybe we should just move the remove_backup fn from the env to mod.rs
> and drop the state update from it?
Okay, will check what are the further implications of that, thanks!
More information about the pbs-devel
mailing list