[pve-devel] [PATCH manager] pvestatd: improve broadcast of node version-info

Thu Feb 27 09:59:28 CET 2025

Am 26.02.25 um 17:02 schrieb Aaron Lauterer:
> 
> 
> On  2025-01-17  13:18, Fiona Ebner wrote:
>> Am 16.01.25 um 17:30 schrieb Aaron Lauterer:
>>> Until now, the pvestatd did broadcast the pve-manager version only once
>>> after startup of the service. But there are some situations, where the
>>> local pmxcfs (pve-cluster) restarts and loses that information.
>>> Basically everytime we restart the pmxcfs without restarting pvestatd
>>> too.
>>>
>>> For example, on a cluster join, or if the pmxcfs has been restarted
>>> manually.
>>>
>>> By additionally checking if the local kv-store of the pmxcfs has any
>>> version info for the node, we can decide if another broadcast is
>>> necessary.
>>> Therefore after the next run of pvestatd, we should have the full
>>> version info available again.
>>>
>>> Signed-off-by: Aaron Lauterer <a.lauterer at proxmox.com>
>>> ---
>>> This patch is preparation to get reliable version infos as I am picking
>>> of the patch series of Folke to include more metrics into the RRD data
>>> and summary graphs. [0]
>>> This was a big blocker and now with the major version change coming up,
>>> we at least can assume the latest 8.x installed as part of the update to
>>> PVE 9.
>>> Therefore, we should get this in with PVE 8. Additional patches for PVE
>>> 8 will follow to make the transition smoother. But as mentioned, this
>>> here is one of the things that needs to work reliably, which is why I
>>> submit the patch already now.
>>
>> If we start relying more on this, we likely also want:
>> https://lore.proxmox.com/pve-devel/20221006125414.58279-1-
>> f.ebner at proxmox.com/
> 
> Hmm, honestly, I might prefer having the last known version info still
> present. That would make it easier to determine if all cluster nodes are
> on at least a required version ;).

That is an edge case where it might be useful, but I'd argue that in
general, it can be problematic to rely on stale information, especially
if you can't detect if it's stale or not. And IMHO, it's worth doing
properly here too, i.e. wait for the node to send its current version.
You already need to wait for nodes that were not online before.

> 
> But I think it would be better, with RRD data migration in mind, to make
> it mandatory that all cluster nodes are online before one can proceed
> instead of relying on stale version infos.
> 
>>
>>>
>>> [0] https://lore.proxmox.com/pve-devel/20231211144721.212071-1-
>>> f.gleumes at proxmox.com/
>>>
>>>   PVE/Service/pvestatd.pm | 5 ++++-
>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/PVE/Service/pvestatd.pm b/PVE/Service/pvestatd.pm
>>> index 7fa003fe..03c578e1 100755
>>> --- a/PVE/Service/pvestatd.pm
>>> +++ b/PVE/Service/pvestatd.pm
>>> @@ -527,7 +527,10 @@ sub update_sdn_status {
>>>     my $broadcast_version_info_done = 0;
>>>   my sub broadcast_version_info : prototype() {
>>> -    if (!$broadcast_version_info_done) {
>>> +    if (
>>> +    !$broadcast_version_info_done
>>> +    || !keys PVE::Cluster::get_node_kv('version-info', $nodename)->%*
>>
>> Style nit: IMHO, it would be easier to read if surrounded by an explicit
>> scalar()
> 
> You mean to have it like this?
> | !scaler(keys PVE::Cluster::get_node_kv('version-info', $nodename)->%*)

Yes (except for the typo ;P)