[pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager

Mon Dec 13 10:02:36 CET 2021

Hi,

On 13.12.21 08:43, Alexandre Derumier wrote:
> Hi,
> 
> this is a proof of concept to implement ressource aware HA.

nice! I'll try to give it a quick view now so that I do not stall you to much
on this long-wished feature.

> The current implementation is really basic,
> simply balancing the number of services on each node.
> 
> I had some real production cases, where a node is failing, and restarted vm
> impact others nodes because of too much cpu/ram usage.
> 
> This new implementation use best-fit heuristic vector packing with constraints support.
> 
> 
> - We compute nodes memory/cpu, and vm memory/cpu average stats  on last 20min
> 
> For each ressource :
> - First, we ordering pending recovery state services by memory, then cpu usage.
>   Memory is more important here, because vm can't start if target node don't have enough memory

agreed

> 
> - Then, we check possible target nodes contraints. (storage available, node have enough cpu/ram, node have enough cores,...)
>   (could be extended with other constraint like vm affinity/anti-affinity, cpu compatibilty, ...)
> 
> - Then we compute a node weight with euclidean distance of both cpu/ram vectors between vm usage and node available ressources.
>   Then we choose the first node with the lower eucliean distance weight.
>   (Ex: if vm use 1go ram/1% cpu, node1 have 2go ram/2% cpu , and node2 have 4go ram/4% cpu,  node1 will be choose because it's the nearest of vm usage)

sounds like an OK approach to me, I had relatively similar in mind

> 
> - We add recovered vm cpu/ram to target node stats. (This is only an best effort estimation, as the vm start is async on target lrm, and could failed,...)
> 
> 
> I have keeped HA group node prio, and other other ordering,
> so this don't break current tests

that is great, the regression test from HA is one of the best we have to
test and simulate behavior, so keeping those unchanged can give quite a bit
of confidence in any implementation. Albeit with your change its mostly because
it's side stepping the balancer as no usage is there?

IMO it would be good to have most tests such that they can get affected by
the balancer, at least if we make it opt-out

> and we can add easily a option at datacenter to enable/disable

As a starter we could also only do the compute-node-by-resource usage on recovery
and first start transition, as especially for the latter it's quite important to
get the service recovered to a node with a low(er) load to avoid domino effect.

Doing re-computation then for started VMs would be easy to add once we're sure
the algorithm works out.

But yeah, for some admins it would surely be welcomed to make it configurable, like:

[ ] move to lowest used node on start and recovery of service
[ ] auto-balance started services periodically

> 
> It could be easy to implement later some kind of vm auto migration when a node use too much cpu/ram,
> reusing same node selection algorithm
> 
> I have added a basic test, I'll add more tests later if this patch serie is ok for you.

I'd add commands to sim_hardware_cmd for simulating cpu/memory increase,
it's nicer to have that controllable by the cmd list.

For the test system it could be also interesting if we can annotate the
services with some basic resource usage, e.g. memory and core count and possibly
also some low (0.33), mid (0.66) and high (1.0) load-factor (that is controllable
by command), that could help to simulate reality while keeping it somewhat simple.

> Some good litterature about heuristics:
> 
> microsoft hyper-v implementation: 
>  - http://kunaltalwar.org/papers/VBPacking.pdf
>  - https://www.microsoft.com/en-us/research/wp-content/uploads/2011/01/virtualization.pdf
> Variable size vector bin packing heuristics:
>  - https://hal.archives-ouvertes.fr/hal-00868016v2/document
> 
> 
> Alexandre Derumier (3):
>   add ressource awareness manager
>   tests: add support for ressources
>   add test-basic0
> 
>  src/PVE/HA/Env.pm                    |  24 +++
>  src/PVE/HA/Env/PVE2.pm               |  90 ++++++++++
>  src/PVE/HA/Manager.pm                | 246 ++++++++++++++++++++++++++-
>  src/PVE/HA/Sim/Hardware.pm           |  61 +++++++
>  src/PVE/HA/Sim/TestEnv.pm            |  36 ++++
>  src/test/test-basic0/README          |   1 +
>  src/test/test-basic0/cmdlist         |   4 +
>  src/test/test-basic0/hardware_status |   5 +
>  src/test/test-basic0/log.expect      |  52 ++++++
>  src/test/test-basic0/manager_status  |   1 +
>  src/test/test-basic0/node_stats      |   5 +
>  src/test/test-basic0/service_config  |   5 +
>  src/test/test-basic0/service_stats   |   5 +
>  13 files changed, 528 insertions(+), 7 deletions(-)
>  create mode 100644 src/test/test-basic0/README
>  create mode 100644 src/test/test-basic0/cmdlist
>  create mode 100644 src/test/test-basic0/hardware_status
>  create mode 100644 src/test/test-basic0/log.expect
>  create mode 100644 src/test/test-basic0/manager_status
>  create mode 100644 src/test/test-basic0/node_stats
>  create mode 100644 src/test/test-basic0/service_config
>  create mode 100644 src/test/test-basic0/service_stats
>