[pve-devel] [PATCH-SERIES v2 ha-manager/docs] add static usage scheduler for HA manager

Fiona Ebner f.ebner at proxmox.com
Thu Nov 17 15:00:01 CET 2022


Right now, the online node usage calculation for the HA manager only
considers the number of active services on each node. This patch
series allows switching to a 'static' scheduler mode instead, where
static usage information from the nodes and guest configurations is
used instead.

With this version, the effect is limited to choosing nodes during
recovery or by migrations triggered by a shutdown plolicy, but the
plan is to extend this in the future.

As a next step, it would be nice to also have for startup, but AFAICT
the issue is that the node selection only happens after the state is
already set to started and I think select_service_node() doesn't
currently know if a service has been newly started. I haven't looked
into it in too much detail though.

An idea to get a balancer out of it, is to:
1. (optionally) sort all services by badness (needs new backend function)
2. iterate scoring the nodes for each service, adding the usage to the
   chosen node after each iteration. The current node can be kept if the
   score compared to the best node doesn't differ too much.
3. record the chosen nodes and migrate the services accordingly.


The online node usage calculation is factored out into a 'Usage'
plugin system to ease adding the new static mode without much
cluttering. If not all nodes provide static service information, we
fall back to the 'basic' mode. If only the scoring fails, the service
count is used as a fallback.


Dependency bumps needed:
proxmox-ha-manager (build)depends on proxmox-perl-rs
The new feature is only usable with updated pve-manager and
pve-cluster of course, but no hard dependency.


Changes from v1:
    * Drop already applied patches.
    * Add tests for HA manager which also required properly adding
      relevant methods to the simulation environment.
    * Implement fallback for scoring in Usage/Static.pm.
    * Improve documentation and mention current limitation with many
      services.


ha-manager:

Fiona Ebner (15):
  env: add get_static_node_stats() method
  resources: add get_static_stats() method
  add Usage base plugin and Usage::Basic plugin
  manager: select service node: add $sid to parameters
  manager: online node usage: switch to Usage::Basic plugin
  usage: add Usage::Static plugin
  env: rename get_ha_settings to get_datacenter_settings
  env: datacenter config: include crs (cluster-resource-scheduling)
    setting
  manager: set resource scheduler mode upon init
  manager: use static resource scheduler when configured
  manager: avoid scoring nodes if maintenance fallback node is valid
  manager: avoid scoring nodes when not trying next and current node is
    valid
  usage: static: use service count on nodes as a fallback
  test: add tests for static resource scheduling
  resources: add missing PVE::Cluster use statements

 debian/pve-ha-manager.install                 |   3 +
 src/PVE/HA/Env.pm                             |  10 +-
 src/PVE/HA/Env/PVE2.pm                        |  27 ++-
 src/PVE/HA/LRM.pm                             |   4 +-
 src/PVE/HA/Makefile                           |   3 +-
 src/PVE/HA/Manager.pm                         |  79 +++++---
 src/PVE/HA/Resources.pm                       |   5 +
 src/PVE/HA/Resources/PVECT.pm                 |  13 ++
 src/PVE/HA/Resources/PVEVM.pm                 |  16 ++
 src/PVE/HA/Sim/Env.pm                         |  13 +-
 src/PVE/HA/Sim/Hardware.pm                    |  28 +++
 src/PVE/HA/Sim/Resources.pm                   |  10 +
 src/PVE/HA/Usage.pm                           |  50 +++++
 src/PVE/HA/Usage/Basic.pm                     |  52 ++++++
 src/PVE/HA/Usage/Makefile                     |   6 +
 src/PVE/HA/Usage/Static.pm                    | 120 ++++++++++++
 src/test/test-crs-static1/README              |   4 +
 src/test/test-crs-static1/cmdlist             |   4 +
 src/test/test-crs-static1/datacenter.cfg      |   6 +
 src/test/test-crs-static1/hardware_status     |   5 +
 src/test/test-crs-static1/log.expect          |  50 +++++
 src/test/test-crs-static1/manager_status      |   1 +
 src/test/test-crs-static1/service_config      |   3 +
 .../test-crs-static1/static_service_stats     |   3 +
 src/test/test-crs-static2/README              |   4 +
 src/test/test-crs-static2/cmdlist             |  20 ++
 src/test/test-crs-static2/datacenter.cfg      |   6 +
 src/test/test-crs-static2/groups              |   2 +
 src/test/test-crs-static2/hardware_status     |   7 +
 src/test/test-crs-static2/log.expect          | 171 ++++++++++++++++++
 src/test/test-crs-static2/manager_status      |   1 +
 src/test/test-crs-static2/service_config      |   3 +
 .../test-crs-static2/static_service_stats     |   3 +
 src/test/test-crs-static3/README              |   5 +
 src/test/test-crs-static3/cmdlist             |   4 +
 src/test/test-crs-static3/datacenter.cfg      |   9 +
 src/test/test-crs-static3/hardware_status     |   5 +
 src/test/test-crs-static3/log.expect          | 131 ++++++++++++++
 src/test/test-crs-static3/manager_status      |   1 +
 src/test/test-crs-static3/service_config      |  12 ++
 .../test-crs-static3/static_service_stats     |  12 ++
 src/test/test-crs-static4/README              |   6 +
 src/test/test-crs-static4/cmdlist             |   4 +
 src/test/test-crs-static4/datacenter.cfg      |   9 +
 src/test/test-crs-static4/hardware_status     |   5 +
 src/test/test-crs-static4/log.expect          | 149 +++++++++++++++
 src/test/test-crs-static4/manager_status      |   1 +
 src/test/test-crs-static4/service_config      |  12 ++
 .../test-crs-static4/static_service_stats     |  12 ++
 src/test/test-crs-static5/README              |   5 +
 src/test/test-crs-static5/cmdlist             |   4 +
 src/test/test-crs-static5/datacenter.cfg      |   9 +
 src/test/test-crs-static5/hardware_status     |   5 +
 src/test/test-crs-static5/log.expect          | 117 ++++++++++++
 src/test/test-crs-static5/manager_status      |   1 +
 src/test/test-crs-static5/service_config      |  10 +
 .../test-crs-static5/static_service_stats     |  11 ++
 src/test/test_failover1.pl                    |  21 ++-
 58 files changed, 1242 insertions(+), 50 deletions(-)
 create mode 100644 src/PVE/HA/Usage.pm
 create mode 100644 src/PVE/HA/Usage/Basic.pm
 create mode 100644 src/PVE/HA/Usage/Makefile
 create mode 100644 src/PVE/HA/Usage/Static.pm
 create mode 100644 src/test/test-crs-static1/README
 create mode 100644 src/test/test-crs-static1/cmdlist
 create mode 100644 src/test/test-crs-static1/datacenter.cfg
 create mode 100644 src/test/test-crs-static1/hardware_status
 create mode 100644 src/test/test-crs-static1/log.expect
 create mode 100644 src/test/test-crs-static1/manager_status
 create mode 100644 src/test/test-crs-static1/service_config
 create mode 100644 src/test/test-crs-static1/static_service_stats
 create mode 100644 src/test/test-crs-static2/README
 create mode 100644 src/test/test-crs-static2/cmdlist
 create mode 100644 src/test/test-crs-static2/datacenter.cfg
 create mode 100644 src/test/test-crs-static2/groups
 create mode 100644 src/test/test-crs-static2/hardware_status
 create mode 100644 src/test/test-crs-static2/log.expect
 create mode 100644 src/test/test-crs-static2/manager_status
 create mode 100644 src/test/test-crs-static2/service_config
 create mode 100644 src/test/test-crs-static2/static_service_stats
 create mode 100644 src/test/test-crs-static3/README
 create mode 100644 src/test/test-crs-static3/cmdlist
 create mode 100644 src/test/test-crs-static3/datacenter.cfg
 create mode 100644 src/test/test-crs-static3/hardware_status
 create mode 100644 src/test/test-crs-static3/log.expect
 create mode 100644 src/test/test-crs-static3/manager_status
 create mode 100644 src/test/test-crs-static3/service_config
 create mode 100644 src/test/test-crs-static3/static_service_stats
 create mode 100644 src/test/test-crs-static4/README
 create mode 100644 src/test/test-crs-static4/cmdlist
 create mode 100644 src/test/test-crs-static4/datacenter.cfg
 create mode 100644 src/test/test-crs-static4/hardware_status
 create mode 100644 src/test/test-crs-static4/log.expect
 create mode 100644 src/test/test-crs-static4/manager_status
 create mode 100644 src/test/test-crs-static4/service_config
 create mode 100644 src/test/test-crs-static4/static_service_stats
 create mode 100644 src/test/test-crs-static5/README
 create mode 100644 src/test/test-crs-static5/cmdlist
 create mode 100644 src/test/test-crs-static5/datacenter.cfg
 create mode 100644 src/test/test-crs-static5/hardware_status
 create mode 100644 src/test/test-crs-static5/log.expect
 create mode 100644 src/test/test-crs-static5/manager_status
 create mode 100644 src/test/test-crs-static5/service_config
 create mode 100644 src/test/test-crs-static5/static_service_stats


docs:

Fiona Ebner (2):
  ha: add section about scheduler modes
  ha: add warning against using 'static' mode with many services

 ha-manager.adoc | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

-- 
2.30.2






More information about the pve-devel mailing list