[pve-devel] [PATCH docs 5/5] ha: replace in-text references to ha groups with ha rules

Mon Aug 4 17:24:33 CEST 2025

On Mon Aug 4, 2025 at 4:11 PM CEST, Daniel Kral wrote:
> As HA groups are replaced by HA node affinity rules and user can
> implement new CRS behavior with HA resource affinity rules now, update
> texts that reference HA groups with references to HA rules instead.
>
> While at it, also replace references to "HA services" with "HA
> resources" for short sections that are touched in the process as new
> references should use the latter term only.
>
> Signed-off-by: Daniel Kral <d.kral at proxmox.com>
> ---
>  ha-manager.adoc | 49 +++++++++++++++++++++++++------------------------
>  1 file changed, 25 insertions(+), 24 deletions(-)
>
> diff --git a/ha-manager.adoc b/ha-manager.adoc
> index ffab83c..f63fd05 100644
> --- a/ha-manager.adoc
> +++ b/ha-manager.adoc
> @@ -314,9 +314,8 @@ recovery state.
>  recovery::
>  
>  Wait for recovery of the service. The HA manager tries to find a new node where
forgot to change the `service` here?
> -the service can run on. This search depends not only on the list of online and
> -quorate nodes, but also if the service is a group member and how such a group
> -is limited.
> +the service can run on. This search depends on the list of online and quorate
s/service/resource/
> +nodes as well as the affinity rules the service is part of, if any.
s/service/resource/
>  As soon as a new available node is found, the service will be moved there and
forgot to change the `service` here?
>  initially placed into stopped state. If it's configured to run the new node
>  will do so.
> @@ -977,20 +976,24 @@ Recover Fenced Services
>  ~~~~~~~~~~~~~~~~~~~~~~~
>  
>  After a node failed and its fencing was successful, the CRM tries to
> -move services from the failed node to nodes which are still online.
> +move HA resources from the failed node to nodes which are still online.
>  
> -The selection of nodes, on which those services gets recovered, is
> -influenced by the resource `group` settings, the list of currently active
> -nodes, and their respective active service count.
> +The selection of the recovery nodes is influenced by the list of
> +currently active nodes, their respective loads depending on the used
> +scheduler, and the affinity rules the resource is part of, if any.
>  
> -The CRM first builds a set out of the intersection between user selected
> -nodes (from `group` setting) and available nodes. It then choose the
> -subset of nodes with the highest priority, and finally select the node
> -with the lowest active service count. This minimizes the possibility
> +First, the CRM builds a set of nodes available to the HA resource. If the
> +resource is part of a node affinity rule, the set is reduced to the
> +highest priority nodes in the node affinity rule. If the resource is part
> +of a resource affinity rule, the set is further reduced to fufill their
> +constraints, which is either keeping the HA resource on the same node as
> +some other HA resources or keeping the HA resource on a different node
> +than some other HA resources. Finally, the CRM selects the node with the
> +lowest load according to the used scheduler to minimize the possibility
>  of an overloaded node.
>  
> -CAUTION: On node failure, the CRM distributes services to the
> -remaining nodes. This increases the service count on those nodes, and
> +CAUTION: On node failure, the CRM distributes resources to the
> +remaining nodes. This increases the resource count on those nodes, and
>  can lead to high load, especially on small clusters. Please design
>  your cluster so that it can handle such worst case scenarios.
>  
> @@ -1102,7 +1105,7 @@ You can use the manual maintenance mode to mark the node as unavailable for HA
>  operation, prompting all services managed by HA to migrate to other nodes.
forgot to change the `service` here?
>  
>  The target nodes for these migrations are selected from the other currently
> -available nodes, and determined by the HA group configuration and the configured
> +available nodes, and determined by the HA rules configuration and the configured
>  cluster resource scheduler (CRS) mode.
>  During each migration, the original node will be recorded in the HA managers'
>  state, so that the service can be moved back again automatically once the
forgot to change the `service` here?
> @@ -1173,14 +1176,12 @@ This triggers a migration of all HA Services currently located on this node.
forgot to change the `service` here?
>  The LRM will try to delay the shutdown process, until all running services get
forgot to change the `service` here?
>  moved away. But, this expects that the running services *can* be migrated to
forgot to change the `service` here?
>  another node. In other words, the service must not be locally bound, for example
forgot to change the `service` here?
> -by using hardware passthrough. As non-group member nodes are considered as
> -runnable target if no group member is available, this policy can still be used
> -when making use of HA groups with only some nodes selected. But, marking a group
> -as 'restricted' tells the HA manager that the service cannot run outside of the
> -chosen set of nodes. If all of those nodes are unavailable, the shutdown will
> -hang until you manually intervene. Once the shut down node comes back online
> -again, the previously displaced services will be moved back, if they were not
> -already manually migrated in-between.
> +by using hardware passthrough. For example, strict node affinity rules tell the
s/For example, s/S/
> +HA Manager that the service cannot run outside of the chosen set of nodes. If all
> +of those nodes are unavailable, the shutdown will hang until you manually
s/those/these/
> +intervene. Once the shut down node comes back online again, the previously
> +displaced services will be moved back, if they were not already manually migrated
> +in-between.
>  
>  NOTE: The watchdog is still active during the migration process on shutdown.
>  If the node loses quorum it will be fenced and the services will be recovered.
> @@ -1266,8 +1267,8 @@ The change will be in effect starting with the next manager round (after a few
>  seconds).
>  
>  For each service that needs to be recovered or migrated, the scheduler
> -iteratively chooses the best node among the nodes with the highest priority in
> -the service's group.
> +iteratively chooses the best node among the nodes that are available to
> +the service according to their HA rules, if any.
Doesn't the scheduler take the ha node affinity priority into
consideration here?
And:
s/service/resource/
>  
>  NOTE: There are plans to add modes for (static and dynamic) load-balancing in
>  the future.