[pve-devel] stream or rrd for ksm sharing counter ?

Alexandre DERUMIER aderumier at odiso.com
Wed Oct 9 13:00:55 CEST 2019


> As we could have 2 servers with 80% memory usage, but real ksm can be different. 
> (for better loadbalancing, we should calc the memory + ksm memory usage). 

>>but would you need to know how much could be shared on the target 
>>node to actually get this calculation right?? 
I don't think we can do right calculation of how much will be shared for the vm after migration.
As it's async anyway, this could take some minutes.

But for the source vm, we have his current memory usage (shared or not shared), so we need to known
if we have enough ressource on target host to handle that.

My point was ksm, is just that if the case where almost all nodes are at 80% memory usage, but some of
them have a lot of ksm shared, we want to priorize the other ones.

Generally, on my production, I try to never have ksm shared memory bigger than 20% of total memory.
(so my real total memory usage, with or without ksm is never more than 100%)

Also I don't have thinked about swap usage (as I don't use swap in production :p)


>>Different load alogs/params could then be tried out and probably users 
>>have different needs. E.g., personally I'd mostly want to do memory as 
>>the rest is just to dynamic in my setups. 

I don't have thinked yet about dynamic memory with balloning in the vm :/


>>As long as no task is stalled just because of lack of CPU resources 
>>there's no need to move things around (when just looking at CPU 
>>balancing), same for memory. 

yes,totally agreed. We don't want to loadbalance to have perfect balancing. (almost impossible to predict).
But if a host use more cpu/ram than a threshold (maybe load from PSI), we could evict some vms to reach this threshold on source host,
and also be under this threshold on target host.


>>We could even poll the PSI interface and get notified if it passes 
>>certain tresholds[0]. 
>>
>>[0]: https://www.kernel.org/doc/html/latest/accounting/psi.html 
>>

yes, psi seem great. I don't have tested it yet.


>>I'd actually separate this all into tow things: 

>>* dynamic balancing: done only if really needed, doesn't cares of 
>>balancing-out as long as all VM/CTs have enough resources to run. 
>>IOW, there could be grave utilization differences but still all 
>>VM/CTs get scheduled, so we do not move.
yes, totally agreed.

>> This should be rather try to ensure all can keep running on a longer time. PSI would 
>>be great here as it can be used to see if a task (group) is 
>>actually not able to run, if another node has better (lower) PSI 
>>then we know that it has less utilization (and we know for real) 
>>and can move something there, if the difference is big enough. 
Yes, it should work.

The current algorithm ordering the target host, using dotproduct, with comparing (cpu,ram,...)
of source vm try to find the best matching (free cpu,free ram,...) of hosts with multiple parameters.
(something like a vector projection of vm to host).
After that, we can simply skip to next host if PSI is too high.
(or if host don't have the shared storage, or other constraint like affinity/antiaffinity).

Also, we should try to reduce the number of migration if possible.
(If host mem is high, try to migrate the vm with highest mem and lowest cpu first for example)


>>* user-triggered balance out: this would only be triggered manually 
>>by an admin. UI should be made so that the suggested movements are 
>>visible. It's somewhat like your proposed patc 

yes, agreed too. (My patch was just for the demo of the algo), but yes,
something like "rebalance <node> <thresold>" for example.



I'll look at PSI counter in my production, to see how it's work with memory usage + ksm + swap ...



----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht at proxmox.com>
À: "pve-devel" <pve-devel at pve.proxmox.com>, "aderumier" <aderumier at odiso.com>
Envoyé: Mercredi 9 Octobre 2019 09:41:58
Objet: Re: [pve-devel] stream or rrd for ksm sharing counter ?

On 10/9/19 8:41 AM, Alexandre DERUMIER wrote: 
> Hi, 
> 
> I'm still trying to improve loadbalancing. 
> 
> Currently we don't stream ksm sharing counter, 
> I think it could be great to stream it or push it to rrd (with extra rrd ? change the current memory format ?) 
> 
> What is the best way to do it ? 
> 
> 
> As we could have 2 servers with 80% memory usage, but real ksm can be different. 
> (for better loadbalancing, we should calc the memory + ksm memory usage). 

but would you need to know how much could be shared on the target 
node to actually get this calculation right?? 

We'd rather keep it simple for now, start off with just the whole code 
infrastructure and static scheduling (i.e., VMs/CTs can be assigned 
resource-use-points) after we have that we have a base we can compare 
against and be sure that we do no "move everything one node away, in a 
circle" situations. 
Different load alogs/params could then be tried out and probably users 
have different needs. E.g., personally I'd mostly want to do memory as 
the rest is just to dynamic in my setups. 

For CPU IMO the correct metric needs to be still found.. But IMO one 
that's somewhat OK at all, could be pressure stall information[0]. 

As long as no task is stalled just because of lack of CPU resources 
there's no need to move things around (when just looking at CPU 
balancing), same for memory. 

We could even poll the PSI interface and get notified if it passes 
certain tresholds[0]. 

[0]: https://www.kernel.org/doc/html/latest/accounting/psi.html 

I'd actually separate this all into tow things: 

* dynamic balancing: done only if really needed, doesn't cares of 
balancing-out as long as all VM/CTs have enough resources to run. 
IOW, there could be grave utilization differences but still all 
VM/CTs get scheduled, so we do not move. This should be rather 
try to ensure all can keep running on a longer time. PSI would 
be great here as it can be used to see if a task (group) is 
actually not able to run, if another node has better (lower) PSI 
then we know that it has less utilization (and we know for real) 
and can move something there, if the difference is big enough. 

* user-triggered balance out: this would only be triggered manually 
by an admin. UI should be made so that the suggested movements are 
visible. It's somewhat like your proposed patc 

just to throw out my ideas :) 




More information about the pve-devel mailing list