[pbs-devel] [PATCH proxmox-backup v2 0/2] fix #6750: fix possible deadlock for s3 backed datastore backups

Fabian Grünbichler f.gruenbichler at proxmox.com
Fri Sep 26 12:45:28 CEST 2025


On September 26, 2025 12:35 pm, Christian Ebner wrote:
> On 9/26/25 12:26 PM, Fabian Grünbichler wrote:
>> On September 26, 2025 10:42 am, Christian Ebner wrote:
>>> These patches aim to fix a deadlock which can occur during backup
>>> jobs to datastores backed by S3 backend. The deadlock most likely is
>>> caused by the mutex guard for the backup shared state being held
>>> while entering the tokio::task::block_in_place context and executing
>>> async code, which however can lead to deadlocks as described in [0].
>>>
>>> Therefore, these patches avoid holding the mutex guard for the shared
>>> backup state while performing the s3 backend operations, by
>>> prematurely dropping it. To avoid inconsistencies, introduce flags
>>> to keep track of the index writers closing state and add a transient
>>> `Finishing` state to be entered during manifest updates.
>>>
>>> Changes since version 1 (thanks @Fabian):
>>> - Use the shared backup state's writers in addition with a closed flag
>>>    instead of counting active backend operations.
>>> - Replace finished flag with BackupState enum to introduce the new,
>>>    transient `Finishing` state to be entered during manifest updates.
>>> - Add missing checks and refactor code to the now mutable reference when
>>>    accessing the shared backup state in the respective close calls.
>> 
>> this looks a lot better!
>> 
>> but I think we both missed one more problematic code path:
>> 
>> - env.remove_backup() (sync)
>> -- locks state
>> -- calls pbs_datastore::datastore::remove_backup() (sync)
>> --- calls pbs_datastore::backup_info::BackupDir::destroy (sync)
>> ---- calls proxmox_async_runtime::block_on(s3_client.delete_objects_by_prefix)
> 
> Good catch!
> 
>> this one is only called in mod.rs *after* the backup session processing
>> is completed, I am not even sure why we call into the env there (all we
>> do with it is set the state to finished, but that has no effect at that
>> point anymore AFAICT?)
> 
> Must double check, but that might be related to allowing the client 
> connection to disappear without further error?

I don't think so, that (ugly hack) happens as part of processing
requests, the removal happens afterwards *based on the result* of that
processing..

>> maybe we should just move the remove_backup fn from the env to mod.rs
>> and drop the state update from it?
> 
> Okay, will check what are the further implications of that, thanks!
> 




More information about the pbs-devel mailing list