[pve-devel] [PATCH cluster v4 1/2] status: introduce new pve-{type}- rrd and metric format

Wed Jul 30 13:21:19 CEST 2025

On Tue Jul 29, 2025 at 11:44 AM CEST, Lukas Wagner wrote:
> Hey Aaron,
>
> some notes inline.
>
> The memory leaks I mentioned should be fixed before this goes in (since
> this is in the long-running pmxcfs process), everything else could also
> happen as a followup. I'm not super familiar with this code, so there
> might be things that I've missed.
>
> On Sat Jul 26, 2025 at 3:05 AM CEST, Aaron Lauterer wrote:
>> With PVE9 now we have additional fields in the metrics that are
>> collected and distributed in the cluster. The new fields/columns are
>> added at the end of the existing ones. This makes it possible for PVE8
>> installations to still use them by cutting the new additional data.
>>
>> To make it more future proof, the format of the keys for each metrics
>> are now changed:
>>
>> Old pre PVE9:  pve{version}-{type}/{id}
>> Now with PVE9: pve-{type}-{version}/{id}
>>
>> This way we have an easier time to handle new versions in the future as
>> we initially only need to check for `pve-{type}-`. If we know the
>> version, we can handle it accordingly; e.g. pad if older format with
>> missing data. If we don't know the version, it must be a newer one and
>> we cut the data stream at the length we need for the current version.
>>
>> This means of course that to avoid a breaking change, we can only add
>> new columns if needed, but not remove any! But waiting for a breaking
>> change until the next major release is a worthy trade-off if it allows
>> us to expand the format in between if needed.
>>
>> The 'rrd_skip_data' function got a new parameter defining the sepataring
>> character. This then makes it possible to use it also to determine which
>> part of the key string is the version/type and which one is the actual
>> resource identifier.
>>
>> We add several new columns to nodes and VMs (guest) RRDs. See futher
>> down for details. Additionally we change the RRA definitions on how we
>> aggregate the data to match how we do it for the Proxmox Backup Server
>> [0].
>>
>> The migration of an existing installation is handled by a dedicated
>> tool. Only once that has happened, will we store data in the new
>> format.
>> This leaves us with a few cases to handle:
>>
>>   data recv →          old                                 new
>>   ↓ rrd files
>>  -------------|---------------------------|-------------------------------------
>>   none        | check if directories exists:
>>               |     neither old or new -> new
>> 	      |     new                -> new
>> 	      |     old only           -> old
>> --------------|---------------------------|-------------------------------------
>>   only old    | use old file as is        | cut new columns and use old file
>> --------------|---------------------------|-------------------------------------
>>   new present | pad data to match new fmt | use new file as is and pass data
>>
>> To handle the padding we use a buffer. Cutting can be handled as we
>> already do it in the stable/bookworm (PVE8) branch by introducing a null
>> terminator in the original string at the end of the expected columns.
>>
>> We add the following new columns:
>>
>> Nodes:
>> * memfree
>> * arcsize
>> * pressures:
>>   * cpu some
>>   * io some
>>   * io full
>>   * mem some
>>   * mem full
>>
>> VMs:
>> * memhost (memory consumption of all processes in the guests cgroup, host view)
>> * pressures:
>>   * cpu some
>>   * cpu full
>>   * io some
>>   * io full
>>   * mem some
>>   * mem full
>>
>> [0] https://git.proxmox.com/?p=proxmox-backup.git;a=blob;f=src/server/metric_collection/rrd.rs;h=ed39cc94ee056924b7adbc21b84c0209478bcf42;hb=dc324716a688a67d700fa133725740ac5d3795ce#l76
>>
>> Signed-off-by: Aaron Lauterer <a.lauterer at proxmox.com>
>> ---
>>  src/pmxcfs/status.c | 261 +++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 236 insertions(+), 25 deletions(-)
>>
>> diff --git a/src/pmxcfs/status.c b/src/pmxcfs/status.c
>> index e6b578b..eaef12c 100644
>> --- a/src/pmxcfs/status.c
>> +++ b/src/pmxcfs/status.c
>> @@ -1096,6 +1096,9 @@ kventry_hash_set(GHashTable *kvhash, const char *key, gconstpointer data, size_t
>>      return TRUE;
>>  }
>>  
>> +// We create the RRD files with a 60 second stepsize, therefore, RRA timesteps
>> +// are alwys per 60 seconds. These 60 seconds are usually showing up in other
>> +// code paths where we interact with RRD data!
>>  static const char *rrd_def_node[] = {
>>      "DS:loadavg:GAUGE:120:0:U",
>>      "DS:maxcpu:GAUGE:120:0:U",
>> @@ -1124,6 +1127,39 @@ static const char *rrd_def_node[] = {
>>      NULL,
>>  };
>>  
>> +static const char *rrd_def_node_pve9_0[] = {
>> +    "DS:loadavg:GAUGE:120:0:U",
>> +    "DS:maxcpu:GAUGE:120:0:U",
>> +    "DS:cpu:GAUGE:120:0:U",
>> +    "DS:iowait:GAUGE:120:0:U",
>> +    "DS:memtotal:GAUGE:120:0:U",
>> +    "DS:memused:GAUGE:120:0:U",
>> +    "DS:swaptotal:GAUGE:120:0:U",
>> +    "DS:swapused:GAUGE:120:0:U",
>> +    "DS:roottotal:GAUGE:120:0:U",
>> +    "DS:rootused:GAUGE:120:0:U",
>> +    "DS:netin:DERIVE:120:0:U",
>> +    "DS:netout:DERIVE:120:0:U",
>> +    "DS:memfree:GAUGE:120:0:U",
>> +    "DS:arcsize:GAUGE:120:0:U",
>> +    "DS:pressurecpusome:GAUGE:120:0:U",
>> +    "DS:pressureiosome:GAUGE:120:0:U",
>> +    "DS:pressureiofull:GAUGE:120:0:U",
>> +    "DS:pressurememorysome:GAUGE:120:0:U",
>> +    "DS:pressurememoryfull:GAUGE:120:0:U",
>> +
>> +    "RRA:AVERAGE:0.5:1:1440",    // 1 min * 1440 => 1 day
>> +    "RRA:AVERAGE:0.5:30:1440",   // 30 min * 1440 => 30 day
>> +    "RRA:AVERAGE:0.5:360:1440",  // 6 hours * 1440 => 360 day ~1 year
>> +    "RRA:AVERAGE:0.5:10080:570", // 1 week * 570 => ~10 years
>> +
>> +    "RRA:MAX:0.5:1:1440",    // 1 min * 1440 => 1 day
>> +    "RRA:MAX:0.5:30:1440",   // 30 min * 1440 => 30 day
>> +    "RRA:MAX:0.5:360:1440",  // 6 hours * 1440 => 360 day ~1 year
>> +    "RRA:MAX:0.5:10080:570", // 1 week * 570 => ~10 years
>> +    NULL,
>> +};
>> +
>>  static const char *rrd_def_vm[] = {
>>      "DS:maxcpu:GAUGE:120:0:U",
>>      "DS:cpu:GAUGE:120:0:U",
>> @@ -1149,6 +1185,36 @@ static const char *rrd_def_vm[] = {
>>      "RRA:MAX:0.5:10080:70", // 7 day max - ony year
>>      NULL,
>>  };
>> +static const char *rrd_def_vm_pve9_0[] = {
>> +    "DS:maxcpu:GAUGE:120:0:U",
>> +    "DS:cpu:GAUGE:120:0:U",
>> +    "DS:maxmem:GAUGE:120:0:U",
>> +    "DS:mem:GAUGE:120:0:U",
>> +    "DS:maxdisk:GAUGE:120:0:U",
>> +    "DS:disk:GAUGE:120:0:U",
>> +    "DS:netin:DERIVE:120:0:U",
>> +    "DS:netout:DERIVE:120:0:U",
>> +    "DS:diskread:DERIVE:120:0:U",
>> +    "DS:diskwrite:DERIVE:120:0:U",
>> +    "DS:memhost:GAUGE:120:0:U",
>> +    "DS:pressurecpusome:GAUGE:120:0:U",
>> +    "DS:pressurecpufull:GAUGE:120:0:U",
>> +    "DS:pressureiosome:GAUGE:120:0:U",
>> +    "DS:pressureiofull:GAUGE:120:0:U",
>> +    "DS:pressurememorysome:GAUGE:120:0:U",
>> +    "DS:pressurememoryfull:GAUGE:120:0:U",
>> +
>> +    "RRA:AVERAGE:0.5:1:1440",    // 1 min * 1440 => 1 day
>> +    "RRA:AVERAGE:0.5:30:1440",   // 30 min * 1440 => 30 day
>> +    "RRA:AVERAGE:0.5:360:1440",  // 6 hours * 1440 => 360 day ~1 year
>> +    "RRA:AVERAGE:0.5:10080:570", // 1 week * 570 => ~10 years
>> +
>> +    "RRA:MAX:0.5:1:1440",    // 1 min * 1440 => 1 day
>> +    "RRA:MAX:0.5:30:1440",   // 30 min * 1440 => 30 day
>> +    "RRA:MAX:0.5:360:1440",  // 6 hours * 1440 => 360 day ~1 year
>> +    "RRA:MAX:0.5:10080:570", // 1 week * 570 => ~10 years
>> +    NULL,
>> +};
>>  
>>  static const char *rrd_def_storage[] = {
>>      "DS:total:GAUGE:120:0:U",
>> @@ -1168,8 +1234,30 @@ static const char *rrd_def_storage[] = {
>>      NULL,
>>  };
>>  
>> +static const char *rrd_def_storage_pve9_0[] = {
>> +    "DS:total:GAUGE:120:0:U",
>> +    "DS:used:GAUGE:120:0:U",
>> +
>> +    "RRA:AVERAGE:0.5:1:1440",    // 1 min * 1440 => 1 day
>> +    "RRA:AVERAGE:0.5:30:1440",   // 30 min * 1440 => 30 day
>> +    "RRA:AVERAGE:0.5:360:1440",  // 6 hours * 1440 => 360 day ~1 year
>> +    "RRA:AVERAGE:0.5:10080:570", // 1 week * 570 => ~10 years
>> +
>> +    "RRA:MAX:0.5:1:1440",    // 1 min * 1440 => 1 day
>> +    "RRA:MAX:0.5:30:1440",   // 30 min * 1440 => 30 day
>> +    "RRA:MAX:0.5:360:1440",  // 6 hours * 1440 => 360 day ~1 year
>> +    "RRA:MAX:0.5:10080:570", // 1 week * 570 => ~10 years
>> +    NULL,
>> +};
>
> Not important right now, but the migration tool and this
> should probably use the same source of the schema definition (e.g. via
> putting this into a header file which is then shared in some way
> (submodule?))
>
>> +
>>  #define RRDDIR "/var/lib/rrdcached/db"
>>  
>> +// A 4k buffer should be plenty to temporarily store RRD data. 64 bit integers are 20 chars long,
>> +// plus the separator char: (4096-1)/21~195 columns This buffer is only used in the
>> +// `update_rrd_data` function. It is safe to use as the calling sites get the global mutex:
>> +// rrd_update_data -> rrdentry_hash_set -> cfs_status_set / and cfs_kvstore_node_set
>> +static char rrd_format_update_buffer[4096];
>
> Maybe I'm missing something, but since this is only used in the
> update_rrd_data function, is there any reason why this statically
> allocated and not just malloc'd as needed?
>
>> +
>>  static void create_rrd_file(const char *filename, int argcount, const char *rrddef[]) {
>>      /* start at day boundary */
>>      time_t ctime;
>> @@ -1229,6 +1317,8 @@ static void update_rrd_data(const char *key, gconstpointer data, size_t len) {
>>  
>>      int skip = 0; // columns to skip at beginning. They contain non-archivable data, like uptime,
>>                    // status, is guest a template and such.
>> +    int padding = 0; // how many columns need to be added with "U" if we get an old format that is
>> +                     // missing columns at the end.
>>      int keep_columns = 0; // how many columns do we want to keep (after initial skip) in case we get
>>                            // more columns than needed from a newer format
>>  
>> @@ -1243,20 +1333,60 @@ static void update_rrd_data(const char *key, gconstpointer data, size_t len) {
>>              goto keyerror;
>>          }
>>  
>> -        skip = 2; // first two columns are live data that isn't archived
>> +        filename = g_strdup_printf(RRDDIR "/pve-node-9.0/%s", node);
>> +        char *filename_pve2 = g_strdup_printf(RRDDIR "/pve2-node/%s", node);
>>  
>> -        if (strncmp(key, "pve-node-", 9) == 0) {
>
> Btw, you can safely use `strcmp` if the thing you are comparing to is a
> string literal, since these are guaranteed to be NULL-terminated. I
> would even argue its better that using `strncmp`, since there is no
> chance of miscounting the length of the string literal (which is also
> hard to spot in code review).
>
> Applies to the other instances where you do the same as well.
>
> Of course, if *both* arguments are a pointer with content that might not
> be NULL-terminated, you are still required to use `strnlen` of avoid
> buffer overruns.
>

Please disregard this comment. While what I've said is true for checking
for a full string match, here you are actually checking for a prefix, so
strnlen is the correct choice here.

I guess we could use strlen to avoid the error potential of passing the
length explicitly. A modern compiler will optimize it anyway and compute
the length at compile time.

	strncmp(key, "pve-node-", strlen("pve-node-"))

To avoid the repetition this could then also be some constant.

But that's all not that important for now and can still be done later
on.