[pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager

Alexandre Derumier aderumier at odiso.com
Mon Dec 13 08:43:13 CET 2021


this is a proof of concept to implement ressource aware HA.

The current implementation is really basic,
simply balancing the number of services on each node.

I had some real production cases, where a node is failing, and restarted vm
impact others nodes because of too much cpu/ram usage.

This new implementation use best-fit heuristic vector packing with constraints support.

- We compute nodes memory/cpu, and vm memory/cpu average stats  on last 20min

For each ressource :
- First, we ordering pending recovery state services by memory, then cpu usage.
  Memory is more important here, because vm can't start if target node don't have enough memory

- Then, we check possible target nodes contraints. (storage available, node have enough cpu/ram, node have enough cores,...)
  (could be extended with other constraint like vm affinity/anti-affinity, cpu compatibilty, ...)

- Then we compute a node weight with euclidean distance of both cpu/ram vectors between vm usage and node available ressources.
  Then we choose the first node with the lower eucliean distance weight.
  (Ex: if vm use 1go ram/1% cpu, node1 have 2go ram/2% cpu , and node2 have 4go ram/4% cpu,  node1 will be choose because it's the nearest of vm usage)

- We add recovered vm cpu/ram to target node stats. (This is only an best effort estimation, as the vm start is async on target lrm, and could failed,...)

I have keeped HA group node prio, and other other ordering,
so this don't break current tests, and we can add easily a option at datacenter to enable/disable

It could be easy to implement later some kind of vm auto migration when a node use too much cpu/ram,
reusing same node selection algorithm

I have added a basic test, I'll add more tests later if this patch serie is ok for you.

Some good litterature about heuristics:

microsoft hyper-v implementation: 
 - http://kunaltalwar.org/papers/VBPacking.pdf
 - https://www.microsoft.com/en-us/research/wp-content/uploads/2011/01/virtualization.pdf
Variable size vector bin packing heuristics:
 - https://hal.archives-ouvertes.fr/hal-00868016v2/document

Alexandre Derumier (3):
  add ressource awareness manager
  tests: add support for ressources
  add test-basic0

 src/PVE/HA/Env.pm                    |  24 +++
 src/PVE/HA/Env/PVE2.pm               |  90 ++++++++++
 src/PVE/HA/Manager.pm                | 246 ++++++++++++++++++++++++++-
 src/PVE/HA/Sim/Hardware.pm           |  61 +++++++
 src/PVE/HA/Sim/TestEnv.pm            |  36 ++++
 src/test/test-basic0/README          |   1 +
 src/test/test-basic0/cmdlist         |   4 +
 src/test/test-basic0/hardware_status |   5 +
 src/test/test-basic0/log.expect      |  52 ++++++
 src/test/test-basic0/manager_status  |   1 +
 src/test/test-basic0/node_stats      |   5 +
 src/test/test-basic0/service_config  |   5 +
 src/test/test-basic0/service_stats   |   5 +
 13 files changed, 528 insertions(+), 7 deletions(-)
 create mode 100644 src/test/test-basic0/README
 create mode 100644 src/test/test-basic0/cmdlist
 create mode 100644 src/test/test-basic0/hardware_status
 create mode 100644 src/test/test-basic0/log.expect
 create mode 100644 src/test/test-basic0/manager_status
 create mode 100644 src/test/test-basic0/node_stats
 create mode 100644 src/test/test-basic0/service_config
 create mode 100644 src/test/test-basic0/service_stats


More information about the pve-devel mailing list