[pdm-devel] [PATCH proxmox-datacenter-manager 10/25] metric collection: collect overdue metrics on startup/timer change
Wolfgang Bumiller
w.bumiller at proxmox.com
Thu Feb 13 16:34:29 CET 2025
On Thu, Feb 13, 2025 at 04:21:33PM +0100, Lukas Wagner wrote:
>
>
> On 2025-02-13 15:19, Wolfgang Bumiller wrote:
> > On Thu, Feb 13, 2025 at 02:50:32PM +0100, Lukas Wagner wrote:
> >>
> >>
> >> On 2025-02-13 09:55, Wolfgang Bumiller wrote:
> >>>> loop {
> >>>> let old_settings = self.settings.clone();
> >>>> tokio::select! {
> >>>> @@ -124,7 +132,12 @@ impl MetricCollectionTask {
> >>>> "metric collection interval changed to {} seconds, reloading timer",
> >>>> interval
> >>>> );
> >>>> - timer = Self::setup_timer(interval);
> >>>> + (timer, next_run) = Self::setup_timer(interval);
> >>>> + // If change (and therefore reset) our timer right before it fires,
> >>>> + // we could potentially miss one collection event.
> >>>
> >>> Couldn't we instead just pass `next_run` through to `setup_timer` and
> >>> call `reset_at(next_run)` on it? (`first_run` would only be used in the
> >>> initial setup, so `next_run` could either be an `Option`, or the setup
> >>> code does the `next_aligned_instant` call...
> >>>
> >>> This should be much less code by making the new
> >>> `fetch_overdue{,_and_save_sate}()` functions unnecessary, or am I
> >>> missing something?
> >>>
> >>
> >> I guess the question is, do we want nicely aligned timer ticks?
> >>
> >> e.g. 14:01:00, 14:02:00, 14:03:00 ... for 60 second interval
> >> or 14:00:00, 14:05:00, 14:10:00 ... for a 5 minute interval?
> >>
> >> Because that was the main intention behind using the 'collection-interval' as
> >> a base for calculating the aligned instant for the first timer reset.
> >> If we reuse the 'old' `next_run` when the interval is changed, we
> >> also reuse the old alignment.
> >>
> >> For instance, when changing from initially 1 minute to 5 minutes, the
> >> timer ticks might come at
> >> 14:01:00, 14:06:00, 14:11:00
> >>
> >> Technically, the naming for the `next_run` variable is not the best,
> >> since it just contains the Instant when the timer *first* fires, but
> >> this is then never updated to the *next* time the timer will fire...
> >> So that means that when changing the interval with your suggested change,
> >> you'd pass an Instant to `reset_at` that is already in the past,
> >> causing the timer to fire immediately.
> >>
> >> If we *don't* care about the aligned ticks as described above, we could
> >> just use a static alignment boundary, e.g. 60 seconds.
> >> In this case we can also get rid of the fetch_overdue stuff, since
> >> at worst case we have 60 seconds until the next tick on startup or timer change,
> >> which should be good enough to prevent any significant gaps in the data.
> >
> > What about setting a flag - if the current next tick was earlier than
> > the new next tick - to tell tick() to re-align the timer when it is next
> > triggered?
>
> The problem is that tokio::time::Interval doesn't give you a way to query when
> the next expected tick will be. We can only approximate it by recalculating
> a new aligned instant with the same interval, but I guess this might behave
> unpredictably in edge cases.
>
> >
> > So when going from 1 to 5 minutes at 14:01:50, we `.reset_at(14:02:00)`
> > and also set `realign = true`, and at 14:02, tick() should
> > `.reset_at(14:05:00)`.
> >
> > I just feel like the logic in the "fetch_overdue" code should not be
> > necessary to have, but if it's too awkward to handle via the tick timer,
> > it's fine to keep it in v2.
>
> fetch_overdue is also called when the daemon starts up. If we align to the collection interval
> and the daemon was down for a while, we otherwise might end up with gaps in the data.
>
> Remotes keep metric history for 30 minutes. If PDM is down for, say, 29 minutes and
> we are aligning to 15min boundaries, in the worst case we might have to wait for
> another 15min to fetch metrics, resulting in a gap.
>
> Of course we could just unconditionally force collection after startup, but I think the
> fetch_overdue solution solves this and the timer change issue quite okayish.
>
> In v2 I got rid of the fetch_overdue_and_save_state wrapper by putting the state.save()
> that we already had in the main loop at the end of the loop, so it's a bit less code now.
> The remaining code is not really that complex, I think I'd prefer to keep it
> for now.
Okay.
More information about the pdm-devel
mailing list