[pbs-devel] [PATCH proxmox 2/2] pbs api types: add option to set GC chunk cleanup wait period

Mon Mar 3 16:30:53 CET 2025

On March 3, 2025 2:30 pm, Christian Ebner wrote:
> On 3/3/25 13:58, Fabian Grünbichler wrote:
>> On February 19, 2025 5:48 pm, Christian Ebner wrote:
>>> Add the `gc-wait-period` option to the datastore tuning parameters.
>>> This allows to specify the time after which the chunks are not
>>> considered in use anymore if their atime has not been updated since
>>> then. This option is only considered, if the `gc-atime-check` is
>>> enabled, to avoid potential data loss.
>>>
>>> The default is to keep chunks within the 24h 5m timespan (given no
>>> active writers). The 5m minimum was chosen to be in line with the
>>> already used safety offset for garbage collection.
>>>
>>> Signed-off-by: Christian Ebner <c.ebner at proxmox.com>
>>> ---
>>> changes since version 1:
>>> - not present in previous version
>>>
>>>   pbs-api-types/src/datastore.rs | 13 +++++++++++++
>>>   1 file changed, 13 insertions(+)
>>>
>>> diff --git a/pbs-api-types/src/datastore.rs b/pbs-api-types/src/datastore.rs
>>> index a7a9db7d..7542e969 100644
>>> --- a/pbs-api-types/src/datastore.rs
>>> +++ b/pbs-api-types/src/datastore.rs
>>> @@ -223,6 +223,13 @@ pub enum DatastoreFSyncLevel {
>>>       Filesystem,
>>>   }
>>>   
>>> +pub const GC_WAIT_PERIOD_SCHEMA: Schema =
>>> +    IntegerSchema::new("Wait period (in minutes) for garbage collection phase 2 chunk cleanup (default 24h 5m)")
>>> +        .minimum(5)
>>> +        .maximum(1445)
>> 
>> these seem a bit conservative - if we introduce the option, we could
>> also allow to reduce the grace period to 0 or to require longer grace
>> periods?
> 
> I did stick to the additional 5 minutes safety gap, as this was also 
> introduced with the oldest writer change in commit 11861a48 
> ("src/backup/chunk_store.rs: fix GC"). Since that commit does not 
> explicitly mention the reason (I assume to prevent a race between writer 
> instantiation and garbage collection), I sticked with this additional 
> lower limit.

if there is a race, we should close it instead of papering over it with
a 5m threshold ;)

a writer must not touch chunks before it has registered (that is a
given, else none of this makes any sense whatsoever, just mentioning it
for completeness sake).

GC must not touch chunks before it has determined the cutoff (also a
given).

the writer registration happens using a shared lock (it's part of
obtaining the shared lock). the same shared lock is used by GC to
retrieve the oldest registered worker (so no race here).

so either:
- at the time GC checks for the older writer, no writer is registered
  yet, and the timestamp obtained right before that check minus the
  safety margin is a valid cutoff
- a worker has registered then GC *must* see it or it must already have
  completed

            let phase1_start_time = proxmox_time::epoch_i64();
            let oldest_writer = self
                .inner
                .chunk_store
                .oldest_writer()
                .unwrap_or(phase1_start_time);

the writer timestamp value is generated during writer creation while the
mutex is held. the GC timestamp value is generated before the mutex is
locked. I don't see a way for a writer timestamp value smaller than the
GC timestamp to be ever stored without GC seeing it?

we basically have a sequence of
- W obtains mutex
- W generates timestamp and persists it
- W drops mutex (and only after this can it start touching/inserting
  chunks)

and another of
- GC generates timestmap
- GC obtains mutex
- GC looks at list of writers
- GC drops mutex

that might interleave. since the latter doesn't write to the protected
data, we can condense it to a single step of "lock and read", and it's
mostly the ordering of GC generating its own timestamp compared to when
the olders writer does the same while holding the mutex that is
interesting, AFAICT.

we either have GC generating its timestamp before the writer, but
reading afterwards:

T0: GC generates timestamp value
sometime between T0 and T1: W obtains mutex
T1: W generates timestamp and persists it in shared guard list
T2: W drops mutex
T3: GC obtains mutex and sees W registered with T1

which is okay, the cutoff can be T0 or T1 in this case it doesn't matter

or the writer generating it before the GC:

sometime before T0: W obtains mutex
T0: W generates timestamp and persists it in shared guard list
T1: W drops mutex
T2: GC generates timestamp value
T3: GC obtains mutex and sees W registered with T0

the cutoff must be T0 in this case

or GC generating it while a writer is registering:

sometime before T0: W obtains mutex
T0: W generates timestamp and persists it in shared guard list
T1: GC generates timestamp value
T2: W drops mutex
T3: GC obtains mutex and sees W registered with T0

T0 will still be seen and is the right cutoff

or another ordering for GC generating while writer registration is
taking place:

sometime before T0: W obtains mutex
T0: GC generates timestamp value
T1: W generates timestamp and persists it in shared guard list
T2: W drops mutex
T3: GC obtains mutex and sees W registered with T1

since T0 < T1, T0 is the right value to use here as well

since we only care about the olders worker, we don't have to worry about
writers racing with eachother - there's only a single mutex, so one of
them must be the oldest and only one we care about. theoretically lots
of writers being spawned could keep GC waiting for the mutext for quite
a while, even allowing writers to complete their run, drop the guard and
thus deregister. but since GC hasn't started its work yet, the index
files generated by those writers must be visible already, and GC doesn't
have to know about their already vanished timestamps.

the only thing I could think of would be time jumps? but then, 5 minutes
are completely arbitrary and might not save anything anyway.. maybe
Dietmar (CCed) remembers where the extra 5 minutes come from?

> Increasing the upper limit might be possible without such concerns, I do 
> not really see however the use case for that.
> 
>> 
>> wait period is also a bit of a misnomer IMHO, this is not something
>> where we "wait" per se, but rather something to account for the
>> timestamp information potentially not being accurate.. not sure what a
>> better term would be, "grace period" doesn't really fit either..
> 
> I sticked with Thomas' naming suggestion here, but maybe something like 
> `gc-clenup-safety-margin` is more fitting? Although it does not imply 
> that this is due to the atime... Other suggestions?

to me, wait period evokes that we *want* to wait. we ideally don't want
to, except if we have to. but like I said, naming *is* hard and I also
don't want to spend too much time on bike shedding ;)

>> it's a kind of "safety margin", maybe something related to that? naming
>> can be quite hard..
>> 
>> maybe gc_atime_safety_margin ? that also implies that setting a
>> non-default value has potential safety implications/risks your data,
>> IMHO more than "reducing a wait period" does..