[pdm-devel] [PATCH proxmox-datacenter-manager v4 0/6] remote task cache fetching task / better cache backend
Lukas Wagner
l.wagner at proxmox.com
Fri Apr 18 10:32:04 CEST 2025
The aim of this patch series is to greatly improve the performance of the
remote task cache for big PDM setups.
The inital, 'dumb' cache implementation had the following problems:
1.) cache was populated as part of the `get_tasks` API, leading to hanging
API calls while fetching task data from remotes
2.) all tasks were stored in a single file, which was completely rewritten
for any change to the cache's contents
3.) The caching mechanism was pretty simple, using only a max-age mechanism,
re-requesting all task data if max-age was exceeded
Now, these characteristics are not really problematic for *small* PDM setups
with only a couple of remotes. However, for big setups (e.g. 100 remotes, each
remote being a PVE cluster with 10 nodes), this completely falls apart:
1.) fetching remote tasks takes considerable amount of time, especially
on connections with a high latency. Since the data is requested
from *within* the `get_tasks` function, which is called by the
`remote-tasks/list` API handler, the API call is blocked until
*all* task data is requested.
2.) The single file approach leads to significant writes to the disk
3.) Leads to unnecessary network IO, as we re-request data that we
already have locally.
To rectify the situation, this series performs the following changes:
- `get_tasks` never does any fetching, it only reads the most recent
data from the cache
- There is a new background task which periodically fetches tasks
from all remotes (every 10mins at the moment). Only the latest
missing tasks are requested, not the full task history as before
- The new background task also takes over the 'tracked task' polling
duty, where we fetch the status for any task started by PDM on
a remote (short polling interval, 10s at the moment).
- The task cache storage implementation has been completely overhauled
and is now optimized for the most common accesses to the cache.
It is also more storage efficient, occupying rougly 50% of the disk
space for the same number of tasks (achieved by avoiding duplicate
information in the files)
- The size of the task cache is 'limited' by doing file rotation.
We keep 7 days of task history.
For details on *how* the cache itself works, please refer to the full
commit message of
remote tasks: implement improved cache for remote tasks
# Benchmarks
Finally, some concrete data to back up the claimed performance improvments. The
times were measured *inside* the `get_tasks` function and not at the API level,
so the times do not include JSON serialization and data transfer.
Benchmarking was done using the 'fake-remote' feature. There were 100 remotes,
10 PVE nodes per remote. The task cache contained about 1.5 million tasks.
before after
list of active tasks (*): ~1.3s ~300µs
list of 500 tasks, offset 0 (**): ~1.3s ~1450µs
list of 500 tasks, offset 1 million (***): ~1.3s ~175ms
Size on disk: ~500MB ~200MB
(*): Requested by the UI every 3s
(**): Requested by the UI when visiting Remotes > Tasks
(***): E.g. when scrolling towars the bottom of 'Remotes > Tasks'
In the old implementation, the archive file was *always* fully deserialized and
loaded into RAM, this is the reason why the time needed is pretty idential for
all scenarios.
The new implementation reads the archive files only line by line, and only 500
tasks were loaded into RAM at the same time. The higher the offset, the more
archive lines/files we have to scan, which increases the time needed to access
the data. The tasks are sorted descending by starttime, as a result the
requests get slower the further you go back in history.
The 'before' times do NOT include the time needed for actually fetching the
task data.
This series was preseded by [1], however almost all of the code has changes,
which is the reason why I send this as a new series.
[1] https://lore.proxmox.com/pdm-devel/20250128122520.167796-1-l.wagner@proxmox.com/
Changes since v3:
- Include benchmark results in commit message
- Remove unneeded and potentially unsafe `pub` (thx Wolfgang)
Changes since v2:
- Change locking approach as suggested by Wolfgang
- Incorporated feedback from Wolfang
- see patch notes for details
- Added some .context/.with_context for better error messages
Changes since v1:
- Drop already applied patches
- Some code style improvents, see individual patch changelogs
- Move tack fetching task to bin/proxmox-datacenter-api/tasks/remote_task.rs
- Make sure that remote_tasks::get_tasks does not block the async executor
proxmox-datacenter-manager:
Lukas Wagner (6):
remote tasks: implement improved cache for remote tasks
remote tasks: add background task for task polling, use new task cache
remote tasks: improve locking for task archive iterator
pdm-api-types: remote tasks: add new_from_str constructor for
TaskStateType
fake remote: make the fake_remote feature compile again
fake remote: clippy fixes
lib/pdm-api-types/src/lib.rs | 15 +
server/src/api/pve/lxc.rs | 10 +-
server/src/api/pve/mod.rs | 4 +-
server/src/api/pve/qemu.rs | 6 +-
server/src/api/remote_tasks.rs | 11 +-
server/src/bin/proxmox-datacenter-api/main.rs | 1 +
.../bin/proxmox-datacenter-api/tasks/mod.rs | 1 +
.../tasks/remote_tasks.rs | 364 ++++++
server/src/remote_tasks/mod.rs | 612 ++--------
server/src/remote_tasks/task_cache.rs | 1020 +++++++++++++++++
server/src/test_support/fake_remote.rs | 35 +-
11 files changed, 1549 insertions(+), 530 deletions(-)
create mode 100644 server/src/bin/proxmox-datacenter-api/tasks/remote_tasks.rs
create mode 100644 server/src/remote_tasks/task_cache.rs
Summary over all repositories:
11 files changed, 1549 insertions(+), 530 deletions(-)
--
Generated by git-murpp 0.8.1
More information about the pdm-devel
mailing list