[pve-devel] avoidable writes of pmxcfs to /var/lib/pve-cluster/config.db ?

Wed Mar 10 10:08:28 CET 2021

On 10.03.21 09:18, Roland wrote:
> 
>>> corruption in particular problem situations like server crash or whatever.
>> So the prime candidate for this write load are the PVE HA Local Resource
>> Manager services on each node, they update their status and that is often
>> required to signal the current Cluster Resource Manager's master service
>> that the HA stack on that node is well alive and that commands got
>> executed with result X. So yes, this is required and intentional.
>> There maybe some room for optimization, but its not that straight forward,
>> and (over-)clever solutions are often the wrong ones for an HA stack - as
>> failure here is something we really want to avoid. But yeah, some easier
>> to pick fruits could maybe be found here.
>>
>> The other thing I just noticed when checking out:
>> # ls -l "/proc/$(pidof pmxcfs)/fd"
>>
>> to get the FDs for all db related FDs and then watch writes with:
>> # strace -v -s $[1<<16] -f -p "$(pidof pmxcfs)" -e write=4,5,6
>>
>> Was seeing additionally some writes for the RSA key files which should just
>> not be there, but I need to closer investigate this, seemed a bit too odd
>> to
>> me.
> not only these, i also see constant rewrite of  (non-changing?) vm
> configuration data , too.
> 
> just cat config.db-wal |strings|grep ..... |sort | uniq -c   to see
> what's getting there.
> 

but that's not a real issue though, the WAL is dimensioned quite big (4 MiB,
while DB is often only 1 or 2 MiB), so it will always contain lots of DB data.
This big WAL actually reduces additional write+syncs as we do not need to
checkpoint it that often, so at least for reads it should be more performant.

Also, the WAL is accessed in read and writes with off-sets (e.g., pwrite64)
and thus only some specific small and contained parts are actually written
newly. Thus you cannot really conclude anything from the total content in it,
only from actual new writes (which can be seen with my strace command).

Regarding the extra data I mentioned, it could be that this is due to sqlite
handling memory pages directly, I need to still check it out closer.

> the weird thing is, that it does not happen for every VM. just some. i
> send you an email with additional data (don't want to post all my VMs
> mac adresses in public)
> 

for now I'm good, thanks, I can check that on my test clusters too - but if
I need anything I'll come back to this offer.

cheers,
Thomas