[pbs-devel] [PATCH v2 proxmox-backup] garbage collection: fix rare race in chunk marking phase

Wed Apr 16 09:11:06 CEST 2025

On April 16, 2025 8:31 am, Christian Ebner wrote:
> Hi Thomas,
> 
> On 4/15/25 17:40, Thomas Lamprecht wrote:
>> On 15/04/2025 15:14, Fabian Grünbichler wrote:
>>>>> this should check the result? this would also fail if a backup is
>>>>> currently going on (very likely if we end up here?) and abort the GC
>>>>> then, but we don't have a way to lock a group with a timeout at the
>>>>> moment.. but maybe we can wait and see if users actually run into that,
>>>>> we can always extend the locking interface then..
>>>> True, but since this is very unlikely to happen, I would opt to fail and
>>>> add an error context here so this can easily be traced back to this code
>>>> path.
>>> yes, for now I'd say aborting GC with a clear error here is best. we
>>> cannot safely continue..
>> 
>> Did not check v3, but note that users often do not run GC with a high
>> frequency due to the load it generates and time it needs, but still
>> depend on it to finish so space is being freed.
>> 
>> So if there is any way we can go or add to avoid aborting completely,
>> it would be IMO quite worth to evaluate doing that more closely.

this should only trigger for setups that do things like very frequent
incremental backups with --keep-last 1 and pruning immediately after
backing up. I think even without Christian's most recently proposed
improvement the chances of hitting this 10 times practically means GC is
impossible in this situation.

the race window is the following after all:
- list snapshots in group
- sort them
- iterate over them but skipping previously marked ones
- mark each not yet seen one
- new snapshot based on previously last one was made since listing, and
  previously last one pruned, before that pruned snapshot was marked

so we'd have to repeatedly hit this sequence

- list group
- backup finished + prune finished in this group
- iteration reaches last snapshot in list

where the whole sequence is repeated in a tight loop, and the delta
between iterations should be a single snapshot so the iteration should
reach it almost instantly.

the situation with any variant of this patch is very different from what
we had before, which was:

- list all indices in the datastore
- iterate and mark
- if any last snapshot in any group was used as base for a new backup,
  and pruned before the iteration reached that group+snapshot, chunks
  could be lost

which for setups where GC took days, made the issue very much possible
to hit.

>> FWIW, an completely different alternative might be to not change
>> GC but pruning when a GC job runs, e.g. (spitballing/hand waving)
>> move the index to a trash folder and notify GC about that.
> 
> yes, having some sort of shadow copy of the index files came to mind as 
> well. I did however disregard that for the GC itself, because it would 
> be expensive and probably run into similar races with pruning.
> 
> Your suggested approach would however eliminate that, and further also 
> be a nice feature! GC could then clean up all the trashed index files 
> with some retention logic in a new phase 3, after cleaning up the chunks.
> 
> E.g. it already happened to some users that they pruned a snapshot they 
> still needed by accident. So might it make sense to add a trash can as 
> feature?

it has one downside - it's not longer possible to prune to get out of
(almost) full datastore situations, unless we also have a "skip trash
can" feature? but yes, it might be nice for non-GC-safety reasons as
well.

for GC, I think the order should be:
- clear out trash can (this doesn't race with marking, so no issues)
- mark (including anything that got added to the trash since clearing it
  out, to prevent the prune+GC race)
- sweep (like now)

else the trash can feature would effectively double the time until
garbage is actually removed, or double the run time of GC because we
have to run it twice back-to-back ;)

if we make the trash can feature unconditional, then once it is
implemented we can drop the retry logic when marking a group, as it's no
longer needed.

> Nevertheless, I do think that changing the order of snapshot iteration 
> for the GC run should be reversed, as that even further reduces the 
> window of opportunity where things can go wrong (as stated in my 
> self-reply to version 3 of the patch).

I think with this change the chances of hitting the retry counter limit
in practice should already be zero..

because if listing is slow, then doing a backup should be slow as well
(and thus the race becomes very unlikely).

but if any user reports aborted GCs because of this (or we are worried
about it), we can simply bump the counter from 10 to 100 or 1000, it
shouldn't affect regular setups in any fashion after all?