[pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager
Alexandre Derumier
aderumier at odiso.com
Mon Dec 13 08:43:13 CET 2021
Hi,
this is a proof of concept to implement ressource aware HA.
The current implementation is really basic,
simply balancing the number of services on each node.
I had some real production cases, where a node is failing, and restarted vm
impact others nodes because of too much cpu/ram usage.
This new implementation use best-fit heuristic vector packing with constraints support.
- We compute nodes memory/cpu, and vm memory/cpu average stats on last 20min
For each ressource :
- First, we ordering pending recovery state services by memory, then cpu usage.
Memory is more important here, because vm can't start if target node don't have enough memory
- Then, we check possible target nodes contraints. (storage available, node have enough cpu/ram, node have enough cores,...)
(could be extended with other constraint like vm affinity/anti-affinity, cpu compatibilty, ...)
- Then we compute a node weight with euclidean distance of both cpu/ram vectors between vm usage and node available ressources.
Then we choose the first node with the lower eucliean distance weight.
(Ex: if vm use 1go ram/1% cpu, node1 have 2go ram/2% cpu , and node2 have 4go ram/4% cpu, node1 will be choose because it's the nearest of vm usage)
- We add recovered vm cpu/ram to target node stats. (This is only an best effort estimation, as the vm start is async on target lrm, and could failed,...)
I have keeped HA group node prio, and other other ordering,
so this don't break current tests, and we can add easily a option at datacenter to enable/disable
It could be easy to implement later some kind of vm auto migration when a node use too much cpu/ram,
reusing same node selection algorithm
I have added a basic test, I'll add more tests later if this patch serie is ok for you.
Some good litterature about heuristics:
microsoft hyper-v implementation:
- http://kunaltalwar.org/papers/VBPacking.pdf
- https://www.microsoft.com/en-us/research/wp-content/uploads/2011/01/virtualization.pdf
Variable size vector bin packing heuristics:
- https://hal.archives-ouvertes.fr/hal-00868016v2/document
Alexandre Derumier (3):
add ressource awareness manager
tests: add support for ressources
add test-basic0
src/PVE/HA/Env.pm | 24 +++
src/PVE/HA/Env/PVE2.pm | 90 ++++++++++
src/PVE/HA/Manager.pm | 246 ++++++++++++++++++++++++++-
src/PVE/HA/Sim/Hardware.pm | 61 +++++++
src/PVE/HA/Sim/TestEnv.pm | 36 ++++
src/test/test-basic0/README | 1 +
src/test/test-basic0/cmdlist | 4 +
src/test/test-basic0/hardware_status | 5 +
src/test/test-basic0/log.expect | 52 ++++++
src/test/test-basic0/manager_status | 1 +
src/test/test-basic0/node_stats | 5 +
src/test/test-basic0/service_config | 5 +
src/test/test-basic0/service_stats | 5 +
13 files changed, 528 insertions(+), 7 deletions(-)
create mode 100644 src/test/test-basic0/README
create mode 100644 src/test/test-basic0/cmdlist
create mode 100644 src/test/test-basic0/hardware_status
create mode 100644 src/test/test-basic0/log.expect
create mode 100644 src/test/test-basic0/manager_status
create mode 100644 src/test/test-basic0/node_stats
create mode 100644 src/test/test-basic0/service_config
create mode 100644 src/test/test-basic0/service_stats
--
2.30.2
More information about the pve-devel
mailing list