[PVE-User] Automatic migration before reboot / shutdown? Migration to host in same group?

Thu Jul 6 11:32:50 CEST 2017

Hi all,

1) I was wondering how a PVE (4.4) cluster will behave when one of the nodes is restarted / shutdown either via WebGUI or via
commandline. Will hosted, HA-managed VMs be migrated to other hosts before shutting down or will they be stopped (and restared on
another host once HA recognizes them as gone)?

2) Currently I run a cluster of four nodes that share the same 2U chassis:

+-----+-----+
|  A  |  B  |
+-----+-----+
|  C  |  D  |
+-----+-----+

(Please don't comment on whether this setup is ideal – I'm aware of the risks a single chassis brings…)

I created several HA groups:

- left  contains A & C
- right contains B & D
- upper contains A & B
- lower contains C & D
- all   contains all nodes

and configured VMs to run inside one of the groups.

For updates I usually follow the following steps:
- migrate VMs from node via "bulk migrate" feature, selecting one of the other nodes
- when no more VMs run, do a "apt-get dist-upgrade" and reboot
- repeat till all nodes are up-to-date

One issue I ran into with this procedure is that sometimes while a VM is still migrated to another host, already migrated VMs are
migrated back onto the current node because the target that was selected for "bulk migrate" was not inside the same group as the
current host.

Practical example:
- VM 101 is configured to run on the left side of the cluster
- VM 102 is configured to run on the lower level of the cluster
- node C shall be updated
- I select "bulk migrate" to node D
- VM 101 is migrated to D
- VM 102 is migrated to D, but takes some time (a lot of RAM)
- HA recognizes that VM 101 is not running in the correct group and schedules a migration back to node C
- migration of VM 102 finishes and migration of VM 101 back to node C immediatelly starts
- once migration of VM 101 has finished I manually need to initate another migration (and after that need to be faster then HA to
do a reboot)

Would it be possible to implement another "bulk action" that will evacuate a host in a way that for every VM, the appropriate
target node is selected, depending on HA group configuration? This might also temporarily disable that node in HA management for
e.g. 10min or until next reboot so that maintenance work can be done…

What do you think of that idea?

Regards,

	Uwe