[pbs-devel] [PATCH v2 proxmox-backup] garbage collection: fix rare race in chunk marking phase

Tue Apr 15 14:50:46 CEST 2025

On 4/15/25 13:42, Fabian Grünbichler wrote:
> On April 15, 2025 12:54 pm, Christian Ebner wrote:
>> During phase 1 of garbage collection referenced chunks are marked as
>> in use by iterating over all index files and updating the atime on
>> the chunks referenced by these.
>>
>> In an edge case for long running garbage collection jobs, where a
>> newly added snapshot (created after the start of GC) reused known
>> chunks from a previous snapshot, but the previous snapshot index
>> referencing them disappeared before the marking phase could reach
>> that index (e.g. pruned because only 1 snapshot to be kept by
>> retention setting), known chunks from that previous index file might
>> not be marked (given that by none of the other index files it was
>> marked).
>>
>> Since commit 74361da8 ("garbage collection: generate index file list
>> via datastore iterators") this is even less likely as now the
>> iteration reads also index files added during phase 1, and
>> therefore either the new or the previous index file will account for
>> these chunks (the previous backup snapshot can only be prunded after
>> the new one finished, since locked). There remains however a small
>> race window between the reading of the snapshots in the backup group
>> and the reading of the actual index files for marking.
>>
>> Fix this race by:
>> 1. Checking if the last snapshot of a group disappeared and if so
>> 2. generate the list again, looking for new index files previously
>>     not accounted for
>> 3. To avoid possible endless looping, lock the group if the snapshot
>>     list changed even after the 10th time (which will lead to
>>     concurrent operations to this group failing).
>>
>> Signed-off-by: Christian Ebner <c.ebner at proxmox.com>
>> ---
>> changes since version 1:
>> - rebased onto current master
>>
>>   pbs-datastore/src/datastore.rs | 110 ++++++++++++++++++++++-----------
>>   1 file changed, 74 insertions(+), 36 deletions(-)
>>
>> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
>> index aa38e2ac1..788dbf212 100644
>> --- a/pbs-datastore/src/datastore.rs
>> +++ b/pbs-datastore/src/datastore.rs
>> @@ -1143,47 +1143,85 @@ impl DataStore {
>>               let namespace = namespace.context("iterating namespaces failed")?;
>>               for group in arc_self.iter_backup_groups(namespace)? {
>>                   let group = group.context("iterating backup groups failed")?;
> 
> pre-existing/unrelated - what if the group got pruned at the right
> moment here? should we maybe recheck that it exists and ignore the error
> in that case?

Hmm, that's right... Same probably applies to the namespace iteration 
above as well...
But I'm not sure we should check if the group exists and ignore? What's 
your reasoning here? As far as I see the whole state of the iterator 
boils down to the readdir_r() calls in nix::dir::Iter::next(), but not 
sure what state one is looking at at that moment.

> 
>> -                let mut snapshots = group.list_backups().context("listing snapshots failed")?;
>> -                // Sort by snapshot timestamp to iterate over consecutive snapshots for each image.
>> -                BackupInfo::sort_list(&mut snapshots, true);
>> -                for snapshot in snapshots {
>> -                    for file in snapshot.files {
>> -                        worker.check_abort()?;
>> -                        worker.fail_on_shutdown()?;
>> -
>> -                        let mut path = snapshot.backup_dir.full_path();
>> -                        path.push(file);
>> -
>> -                        let index = match self.open_index_reader(&path)? {
>> -                            Some(index) => index,
>> -                            None => {
>> -                                unprocessed_index_list.remove(&path);
>> +
>> +                // Avoid race between listing/marking of snapshots by GC and pruning the last
>> +                // snapshot in the group, following a new snapshot creation. Otherwise known chunks
>> +                // might only be referenced by the new snapshot, so it must be read as well.
>> +                let mut retry_counter = 0;
>> +                let mut processed_group_indices = HashSet::new();
>> +                let mut needs_retry;
> 
> I don't think we need this variable (see below)

Acked

> 
>> +                loop {
>> +                    needs_retry = false;
>> +                    let _lock = if retry_counter == 10 {
>> +                        // Last resort, exclusively lock the backup group to make sure no new
>> +                        // backups can be written. This will only trigger if none of the previous
>> +                        // retries converged successfully, which is very unlikely.
>> +                        Some(group.lock())
> 
> this should check the result? this would also fail if a backup is
> currently going on (very likely if we end up here?) and abort the GC
> then, but we don't have a way to lock a group with a timeout at the
> moment.. but maybe we can wait and see if users actually run into that,
> we can always extend the locking interface then..

True, but since this is very unlikely to happen, I would opt to fail and 
add an error context here so this can easily be traced back to this code 
path.

> 
> we should probably also abort if we ever hit a retry_counter > 10, since
> that would mean we missed some bug and this could loop forever..

Ah yes, good point! Will add a bail for that case.

> 
>> +                    } else {
>> +                        None
>> +                    };
>> +
>> +                    let mut snapshots = group.list_backups().context("listing snapshots failed")?;
> 
> same question as above applies here as well - what if the whole group vanished?

Acked, must check

> 
>> +                    let snapshot_count = snapshots.len();
>> +                    // Sort by snapshot timestamp to iterate over consecutive snapshots for each snapshot.
> 
> this comment makes no sense ;)
> 
> // iterate over sorted snapshots to benefit from chunk caching
> 
> or something like this? if we want to keep the comment at all..

yeah, incorrectly adapted the pre-existing one here, but this does not 
give much insight, so lets drop it.

> 
>> +                    BackupInfo::sort_list(&mut snapshots, true);
>> +                    for (count, snapshot) in snapshots.into_iter().enumerate() {
>> +                        for file in snapshot.files {
> 
> snapshot.files also includes blobs, should we filter them here to avoid
> problems if we adapt below to handle partial/ongoing pruning?

Yes, might be better to exclude these at this point already.

> 
>> +                            worker.check_abort()?;
>> +                            worker.fail_on_shutdown()?;
>> +
>> +                            let mut path = snapshot.backup_dir.full_path();
>> +                            path.push(file);
>> +
>> +                            // Avoid reprocessing of already seen index files on retry
>> +                            if retry_counter > 0 && processed_group_indices.contains(&path) {
>>                                   continue;
>>                               }
>> -                        };
>> -                        self.index_mark_used_chunks(
>> -                            index,
>> -                            &path,
>> -                            &mut chunk_lru_cache,
>> -                            status,
>> -                            worker,
>> -                        )?;
>> -
>> -                        if !unprocessed_index_list.remove(&path) {
>> -                            info!("Encountered new index file '{path:?}', increment total index file count");
>> -                            index_count += 1;
>> -                        }
>>   
>> -                        let percentage = (processed_index_files + 1) * 100 / index_count;
>> -                        if percentage > last_percentage {
>> -                            info!(
>> -                                "marked {percentage}% ({} of {index_count} index files)",
>> -                                processed_index_files + 1,
>> -                            );
>> -                            last_percentage = percentage;
>> +                            let index = match self.open_index_reader(&path)? {
>> +                                Some(index) => index,
>> +                                None => {
>> +                                    if count + 1 == snapshot_count
>> +                                        && !snapshot.backup_dir.full_path().exists()
> 
> what if the snapshot dir still exists, but the index doesn't? wouldn't
> that also be problematic? I think we should probably start the loop over
> as soon as something unexpected is going on with the last snapshot of
> the group..

Agreed, pruning should remove the directory as a whole but let's be 
conservative here to cover inconsistent states as well.

> 
>> +                                    {
>> +                                        needs_retry = true;
> 
> since this is the only time we set it to true, and at this point we
> could just start another iteration of the outer loop via labeled
> continue?

Ah, yes that's indeed better, thx!

> 
>> +                                    }
>> +                                    unprocessed_index_list.remove(&path);
> 
> this could move above the if then

Acked

> 
>> +                                    continue;
>> +                                }
>> +                            };
>> +                            self.index_mark_used_chunks(
>> +                                index,
>> +                                &path,
>> +                                &mut chunk_lru_cache,
>> +                                status,
>> +                                worker,
>> +                            )?;
>> +                            processed_group_indices.insert(path.clone());
>> +
>> +                            if !unprocessed_index_list.remove(&path) {
>> +                                info!("Encountered new index file '{path:?}', increment total index file count");
>> +                                index_count += 1;
>> +                            }
>> +
>> +                            let percentage = (processed_index_files + 1) * 100 / index_count;
>> +                            if percentage > last_percentage {
>> +                                info!(
>> +                                    "marked {percentage}% ({} of {index_count} index files)",
>> +                                    processed_index_files + 1,
>> +                                );
>> +                                last_percentage = percentage;
>> +                            }
>> +                            processed_index_files += 1;
>>                           }
>> -                        processed_index_files += 1;
>>                       }
>> +
>> +                    if needs_retry {
>> +                        retry_counter += 1;
>> +                        continue;
>> +                    }
> 
> this whole thing here can go away then..

Acked

> 
>> +
>> +                    break;
>>                   }
>>               }
>>           }
>> -- 
>> 2.39.5
>>
>>
>>
>> _______________________________________________
>> pbs-devel mailing list
>> pbs-devel at lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
>>
>>
>>
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
>