[pbs-devel] [PATCH proxmox-backup 0/3] datastore: remove config reload on hot path
Fabian Grünbichler
f.gruenbichler at proxmox.com
Wed Nov 12 12:27:17 CET 2025
On November 11, 2025 1:29 pm, Samuel Rufinatscha wrote:
> Hi,
>
> this series reduces CPU time in datastore lookups by avoiding repeated
> datastore.cfg reads/parses in both `lookup_datastore()` and
> `DataStore::Drop`. It also adds a TTL so manual config edits are
> noticed without reintroducing hashing on every request.
>
> While investigating #6049 [1], cargo-flamegraph [2] showed hotspots during
> repeated `/status` calls in `lookup_datastore()` and in `Drop`,
> dominated by `pbs_config::datastore::config()` (config parse).
>
> The parsing cost itself should eventually be investigated in a future
> effort. Furthermore, cargo-flamegraph showed that when using a
> token-based auth method to access the API, a significant amount of time
> is spent in validation on every request, likely related to bcrypt.
> Also this should be eventually revisited in a future effort.
please file a bug for the token part, if there isn't already one!
thanks for diving into this, it already looks promising, even though the
effect on more "normal" systems with more reasonable numbers of
datastores and clients will be less pronounced ;)
the big TL;DR would be that we trade faster datastore lookups (which
happen quite frequently, in particular if there are many datastores with
clients checking their status) against slightly delayed reload of the
configuration in case of manual, behind-our-backs edits, with one
particular corner case that is slightly problematic, but also a bit
contrived:
- datastore is looked up
- config is edited (manually) to set maintenance mode to one that
requires removing from the datastore map once the last task exits
- last task drops the datastore struct
- no regular edits happened in the meantime
- the Drop handler doesn't know it needs to remove the datastore from
the map
- open FD is held by proxy, datastore fails to be unmounted/..
we could solve this issue by not only bumping the generation on save,
but also when we reload the config (in particular if we cache the whole
config!). that would make the Drop handler efficient enough for idle
systems that have mostly lookups but no long running tasks. as soon as a
datastore has long running tasks, the last such task will likely exit
long after the TTL for its config lookup has expired, so will need to do
a refresh - although that refresh could again be from the global cache,
instead of from disk? still wouldn't close the window entirely, but make
it pretty unlikely to be hit in practice..
More information about the pbs-devel
mailing list