[pbs-devel] [PATCH v2 proxmox-backup] garbage collection: fix rare race in chunk marking phase
Christian Ebner
c.ebner at proxmox.com
Wed Apr 16 08:31:01 CEST 2025
Hi Thomas,
On 4/15/25 17:40, Thomas Lamprecht wrote:
> On 15/04/2025 15:14, Fabian Grünbichler wrote:
>>>> this should check the result? this would also fail if a backup is
>>>> currently going on (very likely if we end up here?) and abort the GC
>>>> then, but we don't have a way to lock a group with a timeout at the
>>>> moment.. but maybe we can wait and see if users actually run into that,
>>>> we can always extend the locking interface then..
>>> True, but since this is very unlikely to happen, I would opt to fail and
>>> add an error context here so this can easily be traced back to this code
>>> path.
>> yes, for now I'd say aborting GC with a clear error here is best. we
>> cannot safely continue..
>
> Did not check v3, but note that users often do not run GC with a high
> frequency due to the load it generates and time it needs, but still
> depend on it to finish so space is being freed.
>
> So if there is any way we can go or add to avoid aborting completely,
> it would be IMO quite worth to evaluate doing that more closely.
>
> FWIW, an completely different alternative might be to not change
> GC but pruning when a GC job runs, e.g. (spitballing/hand waving)
> move the index to a trash folder and notify GC about that.
yes, having some sort of shadow copy of the index files came to mind as
well. I did however disregard that for the GC itself, because it would
be expensive and probably run into similar races with pruning.
Your suggested approach would however eliminate that, and further also
be a nice feature! GC could then clean up all the trashed index files
with some retention logic in a new phase 3, after cleaning up the chunks.
E.g. it already happened to some users that they pruned a snapshot they
still needed by accident. So might it make sense to add a trash can as
feature?
Nevertheless, I do think that changing the order of snapshot iteration
for the GC run should be reversed, as that even further reduces the
window of opportunity where things can go wrong (as stated in my
self-reply to version 3 of the patch).
More information about the pbs-devel
mailing list