[pve-devel] [PATCH ha-manager 1/9] implement static service stats cache
Daniel Kral
d.kral at proxmox.com
Thu Oct 16 17:15:48 CEST 2025
On Thu Oct 16, 2025 at 1:12 PM CEST, Fiona Ebner wrote:
> Am 30.09.25 um 4:21 PM schrieb Daniel Kral:
>> As the HA Manager builds the static load scheduler, it queries the
>> services' static usage by reading and parsing the static guest configs
>> individually, which can take significant time with respect to how many
>> times recompute_online_node_usage(...) is called.
>>
>> PVE::Cluster exposes an efficient interface to gather a set of
>> properties from one or all guest configs [0]. This is used here to build
>> a rather short-lived cache on every (re)initialization of the static
>> load scheduler.
>>
>> The downside to this approach is if there are way more non-HA managed
>> guests in the cluster than HA-managed guests, which causes this to load
>> much more information than necessary. It also doesn't cache the default
>> values for the environment-specific static service usage, which causes
>> quite a bottleneck as well.
>>
>> [0] pve-cluster cf1b19d (add get_guest_config_property IPCC method)
>>
>> Suggested-by: Fiona Ebner <f.ebner at proxmox.com>
>> Signed-off-by: Daniel Kral <d.kral at proxmox.com>
>> ---
>> If the above mentioned downside is too large for some setups, we could
>> also just add a switch to enable/disable the static stats cache or
>> automatically figure out if it brings any benefits.. But I think this
>> will be good enough, especially with the latter patches making way fewer
>> calls to get_service_usage(...).
>
> Parsing the configs is much more costly than
> PVE::Cluster::get_guest_config_properties(), so improving the situation
> for the majority of use cases of the static scheduler still makes sense,
> even if it hurts performance for some more exotic setups. And we would
> like to consider usage of non-HA guests in the future too to make the
> static scheduler more accurate even in such exotic setups. But with
> PSI-based information that is already included in the node usage, no
> extra preparation necessary there.
I fully agree, that's a rather odd case and in hindsight I doubt that
this has any significant performance improvements for the most common
setups.
Not that important, but I'd move this patch after the granular changes
are implemented since only there this patch makes a performance
improvement while at this point in time and with this implementation it
actually degrades performance.
>
>>
>> src/PVE/HA/Env.pm | 12 ++++++++++++
>> src/PVE/HA/Env/PVE2.pm | 21 +++++++++++++++++++++
>> src/PVE/HA/Manager.pm | 1 +
>
> Needs a rebase because of changes in Manager.pm
>
>> @@ -497,6 +499,25 @@ sub get_datacenter_settings {
>> };
>> }
>>
>> +sub get_static_service_stats {
>> + my ($self, $id) = @_;
>> +
>> + # undef if update_static_service_stats(...) failed before
>> + return undef if !defined($self->{static_service_stats});
>> +
>> + return $self->{static_service_stats}->{$id} // {};
>
> Can't returning '{}' when nothing is there lead to issues down the line?
> If we return undef instead, it's consistent with not having anything
> cached and the caller will fall back to loading the config.
This return value type definitely needs improvement and/or better
documentation, but an undef $self->{static_service_stats}->{$id} value
indicates that it should fallback to the default value as none of the
properties requested by get_guest_config_properties(...) was included in
that particular guest config, e.g. no 'cores', 'sockets', and 'memory'
properties defined in a VM config.
When $self->{static_service_stats} itself is undef, then the static
cache couldn't be queried for some reason.
>
>> @@ -477,6 +467,8 @@ sub new {
>>
>> $self->{service_config} = $self->read_service_config();
>>
>> + $self->{static_service_stats} = undef;
>> +
>> return $self;
>> }
>>
>> @@ -943,6 +935,25 @@ sub watchdog_update {
>> return &$modify_watchog($self, $code);
>> }
>>
>> +sub get_static_service_stats {
>> + my ($self, $id) = @_;
>> +
>> + # undef if update_static_service_stats(...) failed before
>> + return undef if !defined($self->{static_service_stats});
>> +
>> + return $self->{static_service_stats}->{$id} // {};
>
> Same here.
>
>> +}
>> +
>> +sub update_static_service_stats {
>> + my ($self) = @_;
>> +
>> + my $filename = "$self->{statusdir}/static_service_stats";
>> + my $stats = eval { PVE::HA::Tools::read_json_from_file($filename) };
>> + $self->log('warning', "unable to update static service stats cache - $@") if $@;
>> +
>> + $self->{static_service_stats} = $stats;
>> +}
>> +
>> sub get_static_node_stats {
>> my ($self) = @_;
>>
>> diff --git a/src/PVE/HA/Sim/Resources.pm b/src/PVE/HA/Sim/Resources.pm
>> index 72623ee1..7641b1a9 100644
>> --- a/src/PVE/HA/Sim/Resources.pm
>> +++ b/src/PVE/HA/Sim/Resources.pm
>> @@ -143,8 +143,8 @@ sub get_static_stats {
>> my $sid = $class->type() . ":$id";
>> my $hardware = $haenv->hardware();
>>
>> - my $stats = $hardware->read_static_service_stats();
>> - if (my $service_stats = $stats->{$sid}) {
>> + my $service_stats = $hardware->get_static_service_stats($sid);
>> + if (%$service_stats) {
>
> Style nit: with the above change, this could be just $service_stats or a
> definedness check for it. If really checking if there are any keys, I
> prefer being explicit with scalar(keys ...)
ACK will do that, don't know why I didn't see that already :)
>
>> return $service_stats;
>> } elsif ($id =~ /^(\d)(\d\d)/) {
>> # auto assign usage calculated from ID for convenience
More information about the pve-devel
mailing list