[pve-devel] applied: [PATCH ha-manager 1/3] add 'migrate' node shutdown policy

Thomas Lamprecht t.lamprecht at proxmox.com
Mon Nov 25 19:49:11 CET 2019


This adds handling for a new shutdown policy, namely "migrate".
If that is set then the LRM doesn't queues stop jobs, but transitions
to a new mode, namely 'maintenance'.

The LRM modes now get passed from the CRM in the NodeStatus update
method, this allows to detect such a mode and make node-status state
transitions. Effectively we only allow to transition if we're
currently online, else this is ignored. 'maintenance' does not
protects from fencing.

The moving then gets done by select service node. A node in
maintenance mode is not in "list_online_nodes" and so also not in
online_node_usage used to re-calculate if a service needs to be
moved. Only started services will get moved, this can be done almost
by leveraging exiting behavior, the next_state_started FSM state
transition method just needs to be thought to not early return for
nodes which are not online but in maintenance mode.

A few tests get adapted from the other policy tests is added to
showcase behavior with reboot, shutdown, and shutdown of the current
manager. It also shows the behavior when a service cannot be
migrated, albeit as our test system is limited to simulate maximal 9
migration failures, it "seems" to succeed after that. But note here
that the maximal retries would have been hit way more earlier, so
this is just artifact from our test system.

Besides some implementation details two question still are not solved
by this approach:
* what if a service cannot be moved away, either by errors or as no
  alternative node is found by select_service_node
  - retrying indefinitely, this happens currently. The user set this
    up like this in the first place. We will order SSH, pveproxy,
    after the LRM service to ensure that the're still the possibility
    for manual interventions
  - a idea would be to track the time and see if we're stuck (this is
    not to hard), in such a case we could stop the services after X
    minutes and continue.
* a full cluster shutdown, but that is even without this mode not to
  ideal, nodes will get fenced after no partition is quorate anymore,
  already. And as long as it's just a central setting in DC config,
  an admin has a single switch to flip to make it work, so not sure
  how much handling we want to do here, if we go over the point where
  we have no quorum we're dead anyhow, soo.. at least not really an
  issue of this series, orthogonal related yes, but not more.

For real world usability the datacenter.cfg schema needs to be
changed to allow the migrate shutdown policy, but that's trivial

Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
---
 src/PVE/HA/LRM.pm                             |  61 +++++++++-
 src/PVE/HA/Manager.pm                         |  19 ++--
 src/PVE/HA/NodeStatus.pm                      |  19 +++-
 .../cmdlist                                   |   4 +
 .../datacenter.cfg                            |   5 +
 .../hardware_status                           |   5 +
 .../log.expect                                | 106 ++++++++++++++++++
 .../manager_status                            |   1 +
 .../service_config                            |   4 +
 src/test/test-shutdown-policy3/cmdlist        |   4 +
 src/test/test-shutdown-policy3/datacenter.cfg |   5 +
 .../test-shutdown-policy3/hardware_status     |   5 +
 src/test/test-shutdown-policy3/log.expect     |  59 ++++++++++
 src/test/test-shutdown-policy3/manager_status |   1 +
 src/test/test-shutdown-policy3/service_config |   4 +
 src/test/test-shutdown-policy4/cmdlist        |   4 +
 src/test/test-shutdown-policy4/datacenter.cfg |   5 +
 .../test-shutdown-policy4/hardware_status     |   5 +
 src/test/test-shutdown-policy4/log.expect     |  55 +++++++++
 src/test/test-shutdown-policy4/manager_status |   1 +
 src/test/test-shutdown-policy4/service_config |   4 +
 src/test/test-shutdown-policy5/cmdlist        |   4 +
 src/test/test-shutdown-policy5/datacenter.cfg |   5 +
 .../test-shutdown-policy5/hardware_status     |   5 +
 src/test/test-shutdown-policy5/log.expect     |  58 ++++++++++
 src/test/test-shutdown-policy5/manager_status |   1 +
 src/test/test-shutdown-policy5/service_config |   4 +
 27 files changed, 440 insertions(+), 13 deletions(-)
 create mode 100644 src/test/test-shutdown-policy-migrate-fail1/cmdlist
 create mode 100644 src/test/test-shutdown-policy-migrate-fail1/datacenter.cfg
 create mode 100644 src/test/test-shutdown-policy-migrate-fail1/hardware_status
 create mode 100644 src/test/test-shutdown-policy-migrate-fail1/log.expect
 create mode 100644 src/test/test-shutdown-policy-migrate-fail1/manager_status
 create mode 100644 src/test/test-shutdown-policy-migrate-fail1/service_config
 create mode 100644 src/test/test-shutdown-policy3/cmdlist
 create mode 100644 src/test/test-shutdown-policy3/datacenter.cfg
 create mode 100644 src/test/test-shutdown-policy3/hardware_status
 create mode 100644 src/test/test-shutdown-policy3/log.expect
 create mode 100644 src/test/test-shutdown-policy3/manager_status
 create mode 100644 src/test/test-shutdown-policy3/service_config
 create mode 100644 src/test/test-shutdown-policy4/cmdlist
 create mode 100644 src/test/test-shutdown-policy4/datacenter.cfg
 create mode 100644 src/test/test-shutdown-policy4/hardware_status
 create mode 100644 src/test/test-shutdown-policy4/log.expect
 create mode 100644 src/test/test-shutdown-policy4/manager_status
 create mode 100644 src/test/test-shutdown-policy4/service_config
 create mode 100644 src/test/test-shutdown-policy5/cmdlist
 create mode 100644 src/test/test-shutdown-policy5/datacenter.cfg
 create mode 100644 src/test/test-shutdown-policy5/hardware_status
 create mode 100644 src/test/test-shutdown-policy5/log.expect
 create mode 100644 src/test/test-shutdown-policy5/manager_status
 create mode 100644 src/test/test-shutdown-policy5/service_config

diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm
index b5ef8b8..98466a2 100644
--- a/src/PVE/HA/LRM.pm
+++ b/src/PVE/HA/LRM.pm
@@ -16,6 +16,7 @@ use PVE::HA::Resources;
 my $valid_states = {
     wait_for_agent_lock => "waiting for agent lock",
     active => "got agent_lock",
+    maintenance => "going into maintenance",
     lost_agent_lock => "lost agent_lock",
 };
 
@@ -61,18 +62,27 @@ sub shutdown_request {
     }
 
     my $freeze_all;
+    my $maintenance;
     if ($shutdown_policy eq 'conditional') {
 	$freeze_all = $reboot;
     } elsif ($shutdown_policy eq 'freeze') {
 	$freeze_all = 1;
     } elsif ($shutdown_policy eq 'failover') {
 	$freeze_all = 0;
+    } elsif ($shutdown_policy eq 'migrate') {
+	$maintenance = 1;
     } else {
 	$haenv->log('err', "unknown shutdown policy '$shutdown_policy', fall back to conditional");
 	$freeze_all = $reboot;
     }
 
-    if ($shutdown) {
+    if ($maintenance) {
+	# we get marked as unaivalable by the manager, then all services will
+	# be migrated away, we'll still have the same "can we exit" clause than
+	# a normal shutdown -> no running service on this node
+	# FIXME: after X minutes, add shutdown command for remaining services,
+	# e.g., if they have no alternative node???
+    } elsif ($shutdown) {
 	# *always* queue stop jobs for all services if the node shuts down,
 	# independent if it's a reboot or a poweroff, else we may corrupt
 	# services or hinder node shutdown
@@ -89,7 +99,10 @@ sub shutdown_request {
 
     if ($shutdown) {
 	my $shutdown_type = $reboot ? 'reboot' : 'shutdown';
-	if ($freeze_all) {
+	if ($maintenance) {
+	    $haenv->log('info', "$shutdown_type LRM, doing maintenance, removing this node from active list");
+	    $self->{mode} = 'maintenance';
+	} elsif ($freeze_all) {
 	    $haenv->log('info', "$shutdown_type LRM, stop and freeze all services");
 	    $self->{mode} = 'restart';
 	} else {
@@ -101,7 +114,7 @@ sub shutdown_request {
 	$self->{mode} = 'restart';
     }
 
-    $self->{shutdown_request} = 1;
+    $self->{shutdown_request} = $haenv->get_time();
 
     eval { $self->update_lrm_status() or die "not quorate?\n"; };
     if (my $err = $@) {
@@ -300,6 +313,16 @@ sub work {
 	    $self->set_local_status({ state => 'lost_agent_lock'});
 	} elsif (!$self->get_protected_ha_agent_lock()) {
 	    $self->set_local_status({ state => 'lost_agent_lock'});
+	} elsif ($self->{mode} eq 'maintenance') {
+	    $self->set_local_status({ state => 'maintenance'});
+	}
+    } elsif ($state eq 'maintenance') {
+
+	if ($fence_request) {
+	    $haenv->log('err', "node need to be fenced during maintenance mode - releasing agent_lock\n");
+	    $self->set_local_status({ state => 'lost_agent_lock'});
+	} elsif (!$self->get_protected_ha_agent_lock()) {
+	    $self->set_local_status({ state => 'lost_agent_lock'});
 	}
     }
 
@@ -432,6 +455,38 @@ sub work {
 
 	$haenv->sleep(5);
 
+    } elsif ($state eq 'maintenance') {
+
+	my $startime = $haenv->get_time();
+	return if !$self->update_service_status();
+
+	# wait until all active services moved away
+	my $service_count = $self->active_service_count();
+
+	my $exit_lrm = 0;
+
+	if ($self->{shutdown_request}) {
+	    if ($service_count == 0 && $self->run_workers() == 0) {
+		if ($self->{ha_agent_wd}) {
+		    $haenv->watchdog_close($self->{ha_agent_wd});
+		    delete $self->{ha_agent_wd};
+		}
+
+		$exit_lrm = 1;
+
+		# restart with no or freezed services, release the lock
+		$haenv->release_ha_agent_lock();
+	    }
+	}
+
+	$self->manage_resources() if !$exit_lrm;
+
+	$self->update_lrm_status();
+
+	return 0 if $exit_lrm;
+
+	$haenv->sleep_until($startime + 5);
+
     } else {
 
 	die "got unexpected status '$state'\n";
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 3d09433..1f14754 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -369,15 +369,16 @@ sub manage {
 
     my ($haenv, $ms, $ns, $ss) = ($self->{haenv}, $self->{ms}, $self->{ns}, $self->{ss});
 
-    $ns->update($haenv->get_node_info());
-
-    if (!$ns->node_is_online($haenv->nodename())) {
-	$haenv->log('info', "master seems offline");
-	return;
-    }
-
+    my ($node_info) = $haenv->get_node_info();
     my ($lrm_results, $lrm_modes) = $self->read_lrm_status();
 
+    $ns->update($node_info, $lrm_modes);
+
+    if (!$ns->node_is_operational($haenv->nodename())) {
+	$haenv->log('info', "master seems offline");
+	return;
+    }
+
     my $sc = $haenv->read_service_config();
 
     $self->{groups} = $haenv->read_group_config(); # update
@@ -638,7 +639,9 @@ sub next_state_started {
 	if ($ns->node_is_offline_delayed($sd->{node})) {
 	    &$change_service_state($self, $sid, 'fence');
 	}
-	return;
+	if ($ns->get_node_state($sd->{node}) ne 'maintenance') {
+	    return;
+	}
     }
 
     if ($cd->{state} eq 'disabled' || $cd->{state} eq 'stopped') {
diff --git a/src/PVE/HA/NodeStatus.pm b/src/PVE/HA/NodeStatus.pm
index 8784110..9d58fa4 100644
--- a/src/PVE/HA/NodeStatus.pm
+++ b/src/PVE/HA/NodeStatus.pm
@@ -24,6 +24,7 @@ sub new {
 # possible node state:
 my $valid_node_states = {
     online => "node online and member of quorate partition",
+    maintenance => "node is a member of quorate partition but currently not able to do work",
     unknown => "not member of quorate partition, but possibly still running",
     fence => "node needs to be fenced",
     gone => "node vanished from cluster members list, possibly deleted"
@@ -38,6 +39,11 @@ sub get_node_state {
     return $self->{status}->{$node};
 }
 
+sub node_is_operational {
+    my ($self, $node) = @_;
+    return $self->node_is_online($node) || $self->get_node_state($node) eq 'maintenance';
+}
+
 sub node_is_online {
     my ($self, $node) = @_;
 
@@ -117,12 +123,13 @@ my $set_node_state = sub {
 };
 
 sub update {
-    my ($self, $node_info) = @_;
+    my ($self, $node_info, $lrm_modes) = @_;
 
     my $haenv = $self->{haenv};
 
     foreach my $node (sort keys %$node_info) {
 	my $d = $node_info->{$node};
+	my $lrm_mode = $lrm_modes->{$node} // 'unkown';
 	next if !$d->{online};
 
 	# record last time the node was online (required to implement fence delay)
@@ -131,11 +138,19 @@ sub update {
 	my $state = $self->get_node_state($node);
 
 	if ($state eq 'online') {
+	    if ($lrm_mode eq 'maintenance') {
+		#$haenv->log('info', "update node state maintance");
+		$set_node_state->($self, $node, 'maintenance');
+	    }
 	    # &$set_node_state($self, $node, 'online');
 	} elsif ($state eq 'unknown' || $state eq 'gone') {
 	    &$set_node_state($self, $node, 'online');
 	} elsif ($state eq 'fence') {
 	    # do nothing, wait until fenced
+	} elsif ($state eq 'maintenance') {
+	    if ($lrm_mode ne 'maintenance') {
+		$set_node_state->($self, $node, 'online');
+	    }
 	} else {
 	    die "detected unknown node state '$state";
 	}
@@ -149,7 +164,7 @@ sub update {
 
 	# node is not inside quorate partition, possibly not active
 
-	if ($state eq 'online') {
+	if ($state eq 'online' || $state eq 'maintenance') {
 	    &$set_node_state($self, $node, 'unknown');
 	} elsif ($state eq 'unknown') {
 
diff --git a/src/test/test-shutdown-policy-migrate-fail1/cmdlist b/src/test/test-shutdown-policy-migrate-fail1/cmdlist
new file mode 100644
index 0000000..8558351
--- /dev/null
+++ b/src/test/test-shutdown-policy-migrate-fail1/cmdlist
@@ -0,0 +1,4 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"],
+    [ "reboot node3" ]
+]
diff --git a/src/test/test-shutdown-policy-migrate-fail1/datacenter.cfg b/src/test/test-shutdown-policy-migrate-fail1/datacenter.cfg
new file mode 100644
index 0000000..de0bf81
--- /dev/null
+++ b/src/test/test-shutdown-policy-migrate-fail1/datacenter.cfg
@@ -0,0 +1,5 @@
+{
+    "ha": {
+        "shutdown_policy": "migrate"
+    }
+}
diff --git a/src/test/test-shutdown-policy-migrate-fail1/hardware_status b/src/test/test-shutdown-policy-migrate-fail1/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-shutdown-policy-migrate-fail1/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-shutdown-policy-migrate-fail1/log.expect b/src/test/test-shutdown-policy-migrate-fail1/log.expect
new file mode 100644
index 0000000..79664c7
--- /dev/null
+++ b/src/test/test-shutdown-policy-migrate-fail1/log.expect
@@ -0,0 +1,106 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info     20    node1/crm: adding new service 'fa:109' on node 'node3'
+info     20    node1/crm: adding new service 'vm:103' on node 'node3'
+info     22    node2/crm: status change wait_for_quorum => slave
+info     24    node3/crm: status change wait_for_quorum => slave
+info     25    node3/lrm: got lock 'ha_agent_node3_lock'
+info     25    node3/lrm: status change wait_for_agent_lock => active
+info     25    node3/lrm: starting service fa:109
+info     25    node3/lrm: service status fa:109 started
+info     25    node3/lrm: starting service vm:103
+info     25    node3/lrm: service status vm:103 started
+info    120      cmdlist: execute reboot node3
+info    120    node3/lrm: got shutdown request with shutdown policy 'migrate'
+info    120    node3/lrm: reboot LRM, doing maintenance, removing this node from active list
+info    120    node1/crm: node 'node3': state changed from 'online' => 'maintenance'
+info    120    node1/crm: relocate service 'fa:109' to node 'node1'
+info    120    node1/crm: service 'fa:109': state changed from 'started' to 'relocate'  (node = node3, target = node1)
+info    120    node1/crm: migrate service 'vm:103' to node 'node1' (running)
+info    120    node1/crm: service 'vm:103': state changed from 'started' to 'migrate'  (node = node3, target = node1)
+info    125    node3/lrm: status change active => maintenance
+err     125    node3/lrm: service fa:109 not moved (migration error)
+info    125    node3/lrm: service vm:103 - start migrate to node 'node1'
+info    125    node3/lrm: service vm:103 - end migrate to node 'node1'
+err     140    node1/crm: service 'fa:109' - migration failed (exit code 1)
+info    140    node1/crm: service 'fa:109': state changed from 'relocate' to 'started'  (node = node3)
+info    140    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node1)
+info    140    node1/crm: relocate service 'fa:109' to node 'node2'
+info    140    node1/crm: service 'fa:109': state changed from 'started' to 'relocate'  (node = node3, target = node2)
+info    141    node1/lrm: got lock 'ha_agent_node1_lock'
+info    141    node1/lrm: status change wait_for_agent_lock => active
+info    141    node1/lrm: starting service vm:103
+info    141    node1/lrm: service status vm:103 started
+err     145    node3/lrm: service fa:109 not moved (migration error)
+err     160    node1/crm: service 'fa:109' - migration failed (exit code 1)
+info    160    node1/crm: service 'fa:109': state changed from 'relocate' to 'started'  (node = node3)
+info    160    node1/crm: relocate service 'fa:109' to node 'node2'
+info    160    node1/crm: service 'fa:109': state changed from 'started' to 'relocate'  (node = node3, target = node2)
+err     165    node3/lrm: service fa:109 not moved (migration error)
+err     180    node1/crm: service 'fa:109' - migration failed (exit code 1)
+info    180    node1/crm: service 'fa:109': state changed from 'relocate' to 'started'  (node = node3)
+info    180    node1/crm: relocate service 'fa:109' to node 'node2'
+info    180    node1/crm: service 'fa:109': state changed from 'started' to 'relocate'  (node = node3, target = node2)
+err     185    node3/lrm: service fa:109 not moved (migration error)
+err     200    node1/crm: service 'fa:109' - migration failed (exit code 1)
+info    200    node1/crm: service 'fa:109': state changed from 'relocate' to 'started'  (node = node3)
+info    200    node1/crm: relocate service 'fa:109' to node 'node2'
+info    200    node1/crm: service 'fa:109': state changed from 'started' to 'relocate'  (node = node3, target = node2)
+err     205    node3/lrm: service fa:109 not moved (migration error)
+err     220    node1/crm: service 'fa:109' - migration failed (exit code 1)
+info    220    node1/crm: service 'fa:109': state changed from 'relocate' to 'started'  (node = node3)
+info    220    node1/crm: relocate service 'fa:109' to node 'node2'
+info    220    node1/crm: service 'fa:109': state changed from 'started' to 'relocate'  (node = node3, target = node2)
+err     225    node3/lrm: service fa:109 not moved (migration error)
+err     240    node1/crm: service 'fa:109' - migration failed (exit code 1)
+info    240    node1/crm: service 'fa:109': state changed from 'relocate' to 'started'  (node = node3)
+info    240    node1/crm: relocate service 'fa:109' to node 'node2'
+info    240    node1/crm: service 'fa:109': state changed from 'started' to 'relocate'  (node = node3, target = node2)
+err     245    node3/lrm: service fa:109 not moved (migration error)
+err     260    node1/crm: service 'fa:109' - migration failed (exit code 1)
+info    260    node1/crm: service 'fa:109': state changed from 'relocate' to 'started'  (node = node3)
+info    260    node1/crm: relocate service 'fa:109' to node 'node2'
+info    260    node1/crm: service 'fa:109': state changed from 'started' to 'relocate'  (node = node3, target = node2)
+err     265    node3/lrm: service fa:109 not moved (migration error)
+err     280    node1/crm: service 'fa:109' - migration failed (exit code 1)
+info    280    node1/crm: service 'fa:109': state changed from 'relocate' to 'started'  (node = node3)
+info    280    node1/crm: relocate service 'fa:109' to node 'node2'
+info    280    node1/crm: service 'fa:109': state changed from 'started' to 'relocate'  (node = node3, target = node2)
+err     285    node3/lrm: service fa:109 not moved (migration error)
+err     300    node1/crm: service 'fa:109' - migration failed (exit code 1)
+info    300    node1/crm: service 'fa:109': state changed from 'relocate' to 'started'  (node = node3)
+info    300    node1/crm: relocate service 'fa:109' to node 'node2'
+info    300    node1/crm: service 'fa:109': state changed from 'started' to 'relocate'  (node = node3, target = node2)
+info    305    node3/lrm: service fa:109 - start relocate to node 'node2'
+info    305    node3/lrm: stopping service fa:109 (relocate)
+info    305    node3/lrm: service status fa:109 stopped
+info    305    node3/lrm: service fa:109 - end relocate to node 'node2'
+info    320    node1/crm: service 'fa:109': state changed from 'relocate' to 'started'  (node = node2)
+info    323    node2/lrm: got lock 'ha_agent_node2_lock'
+info    323    node2/lrm: status change wait_for_agent_lock => active
+info    323    node2/lrm: starting service fa:109
+info    323    node2/lrm: service status fa:109 started
+info    326    node3/lrm: exit (loop end)
+info    326       reboot: execute crm node3 stop
+info    325    node3/crm: server received shutdown request
+info    345    node3/crm: exit (loop end)
+info    345       reboot: execute power node3 off
+info    345       reboot: execute power node3 on
+info    345    node3/crm: status change startup => wait_for_quorum
+info    340    node3/lrm: status change startup => wait_for_agent_lock
+info    360    node1/crm: node 'node3': state changed from 'maintenance' => 'online'
+info    364    node3/crm: status change wait_for_quorum => slave
+info    720     hardware: exit simulation - done
diff --git a/src/test/test-shutdown-policy-migrate-fail1/manager_status b/src/test/test-shutdown-policy-migrate-fail1/manager_status
new file mode 100644
index 0000000..0967ef4
--- /dev/null
+++ b/src/test/test-shutdown-policy-migrate-fail1/manager_status
@@ -0,0 +1 @@
+{}
diff --git a/src/test/test-shutdown-policy-migrate-fail1/service_config b/src/test/test-shutdown-policy-migrate-fail1/service_config
new file mode 100644
index 0000000..458bb5e
--- /dev/null
+++ b/src/test/test-shutdown-policy-migrate-fail1/service_config
@@ -0,0 +1,4 @@
+{
+    "vm:103": { "node": "node3", "state": "enabled" },
+    "fa:109": { "node": "node3", "state": "enabled" }
+}
diff --git a/src/test/test-shutdown-policy3/cmdlist b/src/test/test-shutdown-policy3/cmdlist
new file mode 100644
index 0000000..8558351
--- /dev/null
+++ b/src/test/test-shutdown-policy3/cmdlist
@@ -0,0 +1,4 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"],
+    [ "reboot node3" ]
+]
diff --git a/src/test/test-shutdown-policy3/datacenter.cfg b/src/test/test-shutdown-policy3/datacenter.cfg
new file mode 100644
index 0000000..de0bf81
--- /dev/null
+++ b/src/test/test-shutdown-policy3/datacenter.cfg
@@ -0,0 +1,5 @@
+{
+    "ha": {
+        "shutdown_policy": "migrate"
+    }
+}
diff --git a/src/test/test-shutdown-policy3/hardware_status b/src/test/test-shutdown-policy3/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-shutdown-policy3/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-shutdown-policy3/log.expect b/src/test/test-shutdown-policy3/log.expect
new file mode 100644
index 0000000..6ecf211
--- /dev/null
+++ b/src/test/test-shutdown-policy3/log.expect
@@ -0,0 +1,59 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info     20    node1/crm: adding new service 'ct:102' on node 'node3'
+info     20    node1/crm: adding new service 'vm:103' on node 'node3'
+info     22    node2/crm: status change wait_for_quorum => slave
+info     24    node3/crm: status change wait_for_quorum => slave
+info     25    node3/lrm: got lock 'ha_agent_node3_lock'
+info     25    node3/lrm: status change wait_for_agent_lock => active
+info     25    node3/lrm: starting service ct:102
+info     25    node3/lrm: service status ct:102 started
+info     25    node3/lrm: starting service vm:103
+info     25    node3/lrm: service status vm:103 started
+info    120      cmdlist: execute reboot node3
+info    120    node3/lrm: got shutdown request with shutdown policy 'migrate'
+info    120    node3/lrm: reboot LRM, doing maintenance, removing this node from active list
+info    120    node1/crm: node 'node3': state changed from 'online' => 'maintenance'
+info    120    node1/crm: relocate service 'ct:102' to node 'node1'
+info    120    node1/crm: service 'ct:102': state changed from 'started' to 'relocate'  (node = node3, target = node1)
+info    120    node1/crm: migrate service 'vm:103' to node 'node1' (running)
+info    120    node1/crm: service 'vm:103': state changed from 'started' to 'migrate'  (node = node3, target = node1)
+info    125    node3/lrm: status change active => maintenance
+info    125    node3/lrm: service ct:102 - start relocate to node 'node1'
+info    125    node3/lrm: stopping service ct:102 (relocate)
+info    125    node3/lrm: service status ct:102 stopped
+info    125    node3/lrm: service ct:102 - end relocate to node 'node1'
+info    125    node3/lrm: service vm:103 - start migrate to node 'node1'
+info    125    node3/lrm: service vm:103 - end migrate to node 'node1'
+info    140    node1/crm: service 'ct:102': state changed from 'relocate' to 'started'  (node = node1)
+info    140    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node1)
+info    141    node1/lrm: got lock 'ha_agent_node1_lock'
+info    141    node1/lrm: status change wait_for_agent_lock => active
+info    141    node1/lrm: starting service ct:102
+info    141    node1/lrm: service status ct:102 started
+info    141    node1/lrm: starting service vm:103
+info    141    node1/lrm: service status vm:103 started
+info    146    node3/lrm: exit (loop end)
+info    146       reboot: execute crm node3 stop
+info    145    node3/crm: server received shutdown request
+info    165    node3/crm: exit (loop end)
+info    165       reboot: execute power node3 off
+info    165       reboot: execute power node3 on
+info    165    node3/crm: status change startup => wait_for_quorum
+info    160    node3/lrm: status change startup => wait_for_agent_lock
+info    180    node1/crm: node 'node3': state changed from 'maintenance' => 'online'
+info    184    node3/crm: status change wait_for_quorum => slave
+info    720     hardware: exit simulation - done
diff --git a/src/test/test-shutdown-policy3/manager_status b/src/test/test-shutdown-policy3/manager_status
new file mode 100644
index 0000000..0967ef4
--- /dev/null
+++ b/src/test/test-shutdown-policy3/manager_status
@@ -0,0 +1 @@
+{}
diff --git a/src/test/test-shutdown-policy3/service_config b/src/test/test-shutdown-policy3/service_config
new file mode 100644
index 0000000..8ee94b5
--- /dev/null
+++ b/src/test/test-shutdown-policy3/service_config
@@ -0,0 +1,4 @@
+{
+    "vm:103": { "node": "node3", "state": "enabled" },
+    "ct:102": { "node": "node3", "state": "enabled" }
+}
diff --git a/src/test/test-shutdown-policy4/cmdlist b/src/test/test-shutdown-policy4/cmdlist
new file mode 100644
index 0000000..a86b9e2
--- /dev/null
+++ b/src/test/test-shutdown-policy4/cmdlist
@@ -0,0 +1,4 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"],
+    [ "shutdown node3" ]
+]
diff --git a/src/test/test-shutdown-policy4/datacenter.cfg b/src/test/test-shutdown-policy4/datacenter.cfg
new file mode 100644
index 0000000..de0bf81
--- /dev/null
+++ b/src/test/test-shutdown-policy4/datacenter.cfg
@@ -0,0 +1,5 @@
+{
+    "ha": {
+        "shutdown_policy": "migrate"
+    }
+}
diff --git a/src/test/test-shutdown-policy4/hardware_status b/src/test/test-shutdown-policy4/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-shutdown-policy4/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-shutdown-policy4/log.expect b/src/test/test-shutdown-policy4/log.expect
new file mode 100644
index 0000000..2e31059
--- /dev/null
+++ b/src/test/test-shutdown-policy4/log.expect
@@ -0,0 +1,55 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info     20    node1/crm: adding new service 'ct:102' on node 'node3'
+info     20    node1/crm: adding new service 'vm:103' on node 'node3'
+info     22    node2/crm: status change wait_for_quorum => slave
+info     24    node3/crm: status change wait_for_quorum => slave
+info     25    node3/lrm: got lock 'ha_agent_node3_lock'
+info     25    node3/lrm: status change wait_for_agent_lock => active
+info     25    node3/lrm: starting service ct:102
+info     25    node3/lrm: service status ct:102 started
+info     25    node3/lrm: starting service vm:103
+info     25    node3/lrm: service status vm:103 started
+info    120      cmdlist: execute shutdown node3
+info    120    node3/lrm: got shutdown request with shutdown policy 'migrate'
+info    120    node3/lrm: shutdown LRM, doing maintenance, removing this node from active list
+info    120    node1/crm: node 'node3': state changed from 'online' => 'maintenance'
+info    120    node1/crm: relocate service 'ct:102' to node 'node1'
+info    120    node1/crm: service 'ct:102': state changed from 'started' to 'relocate'  (node = node3, target = node1)
+info    120    node1/crm: migrate service 'vm:103' to node 'node1' (running)
+info    120    node1/crm: service 'vm:103': state changed from 'started' to 'migrate'  (node = node3, target = node1)
+info    125    node3/lrm: status change active => maintenance
+info    125    node3/lrm: service ct:102 - start relocate to node 'node1'
+info    125    node3/lrm: stopping service ct:102 (relocate)
+info    125    node3/lrm: service status ct:102 stopped
+info    125    node3/lrm: service ct:102 - end relocate to node 'node1'
+info    125    node3/lrm: service vm:103 - start migrate to node 'node1'
+info    125    node3/lrm: service vm:103 - end migrate to node 'node1'
+info    140    node1/crm: service 'ct:102': state changed from 'relocate' to 'started'  (node = node1)
+info    140    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node1)
+info    141    node1/lrm: got lock 'ha_agent_node1_lock'
+info    141    node1/lrm: status change wait_for_agent_lock => active
+info    141    node1/lrm: starting service ct:102
+info    141    node1/lrm: service status ct:102 started
+info    141    node1/lrm: starting service vm:103
+info    141    node1/lrm: service status vm:103 started
+info    146    node3/lrm: exit (loop end)
+info    146     shutdown: execute crm node3 stop
+info    145    node3/crm: server received shutdown request
+info    165    node3/crm: exit (loop end)
+info    165     shutdown: execute power node3 off
+info    180    node1/crm: node 'node3': state changed from 'maintenance' => 'unknown'
+info    720     hardware: exit simulation - done
diff --git a/src/test/test-shutdown-policy4/manager_status b/src/test/test-shutdown-policy4/manager_status
new file mode 100644
index 0000000..0967ef4
--- /dev/null
+++ b/src/test/test-shutdown-policy4/manager_status
@@ -0,0 +1 @@
+{}
diff --git a/src/test/test-shutdown-policy4/service_config b/src/test/test-shutdown-policy4/service_config
new file mode 100644
index 0000000..8ee94b5
--- /dev/null
+++ b/src/test/test-shutdown-policy4/service_config
@@ -0,0 +1,4 @@
+{
+    "vm:103": { "node": "node3", "state": "enabled" },
+    "ct:102": { "node": "node3", "state": "enabled" }
+}
diff --git a/src/test/test-shutdown-policy5/cmdlist b/src/test/test-shutdown-policy5/cmdlist
new file mode 100644
index 0000000..e84297f
--- /dev/null
+++ b/src/test/test-shutdown-policy5/cmdlist
@@ -0,0 +1,4 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"],
+    [ "shutdown node1" ]
+]
diff --git a/src/test/test-shutdown-policy5/datacenter.cfg b/src/test/test-shutdown-policy5/datacenter.cfg
new file mode 100644
index 0000000..de0bf81
--- /dev/null
+++ b/src/test/test-shutdown-policy5/datacenter.cfg
@@ -0,0 +1,5 @@
+{
+    "ha": {
+        "shutdown_policy": "migrate"
+    }
+}
diff --git a/src/test/test-shutdown-policy5/hardware_status b/src/test/test-shutdown-policy5/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-shutdown-policy5/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-shutdown-policy5/log.expect b/src/test/test-shutdown-policy5/log.expect
new file mode 100644
index 0000000..15f67c2
--- /dev/null
+++ b/src/test/test-shutdown-policy5/log.expect
@@ -0,0 +1,58 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info     20    node1/crm: adding new service 'ct:102' on node 'node1'
+info     20    node1/crm: adding new service 'vm:103' on node 'node1'
+info     21    node1/lrm: got lock 'ha_agent_node1_lock'
+info     21    node1/lrm: status change wait_for_agent_lock => active
+info     21    node1/lrm: starting service ct:102
+info     21    node1/lrm: service status ct:102 started
+info     21    node1/lrm: starting service vm:103
+info     21    node1/lrm: service status vm:103 started
+info     22    node2/crm: status change wait_for_quorum => slave
+info     24    node3/crm: status change wait_for_quorum => slave
+info    120      cmdlist: execute shutdown node1
+info    120    node1/lrm: got shutdown request with shutdown policy 'migrate'
+info    120    node1/lrm: shutdown LRM, doing maintenance, removing this node from active list
+info    120    node1/crm: node 'node1': state changed from 'online' => 'maintenance'
+info    120    node1/crm: relocate service 'ct:102' to node 'node2'
+info    120    node1/crm: service 'ct:102': state changed from 'started' to 'relocate'  (node = node1, target = node2)
+info    120    node1/crm: migrate service 'vm:103' to node 'node2' (running)
+info    120    node1/crm: service 'vm:103': state changed from 'started' to 'migrate'  (node = node1, target = node2)
+info    121    node1/lrm: status change active => maintenance
+info    121    node1/lrm: service ct:102 - start relocate to node 'node2'
+info    121    node1/lrm: stopping service ct:102 (relocate)
+info    121    node1/lrm: service status ct:102 stopped
+info    121    node1/lrm: service ct:102 - end relocate to node 'node2'
+info    121    node1/lrm: service vm:103 - start migrate to node 'node2'
+info    121    node1/lrm: service vm:103 - end migrate to node 'node2'
+info    140    node1/crm: service 'ct:102': state changed from 'relocate' to 'started'  (node = node2)
+info    140    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node2)
+info    142    node1/lrm: exit (loop end)
+info    142     shutdown: execute crm node1 stop
+info    141    node1/crm: server received shutdown request
+info    143    node2/lrm: got lock 'ha_agent_node2_lock'
+info    143    node2/lrm: status change wait_for_agent_lock => active
+info    143    node2/lrm: starting service ct:102
+info    143    node2/lrm: service status ct:102 started
+info    143    node2/lrm: starting service vm:103
+info    143    node2/lrm: service status vm:103 started
+info    160    node1/crm: voluntary release CRM lock
+info    161    node1/crm: exit (loop end)
+info    161     shutdown: execute power node1 off
+info    161    node2/crm: got lock 'ha_manager_lock'
+info    161    node2/crm: status change slave => master
+info    161    node2/crm: node 'node1': state changed from 'maintenance' => 'unknown'
+info    720     hardware: exit simulation - done
diff --git a/src/test/test-shutdown-policy5/manager_status b/src/test/test-shutdown-policy5/manager_status
new file mode 100644
index 0000000..0967ef4
--- /dev/null
+++ b/src/test/test-shutdown-policy5/manager_status
@@ -0,0 +1 @@
+{}
diff --git a/src/test/test-shutdown-policy5/service_config b/src/test/test-shutdown-policy5/service_config
new file mode 100644
index 0000000..d37e5b3
--- /dev/null
+++ b/src/test/test-shutdown-policy5/service_config
@@ -0,0 +1,4 @@
+{
+    "vm:103": { "node": "node1", "state": "enabled" },
+    "ct:102": { "node": "node1", "state": "enabled" }
+}
-- 
2.20.1





More information about the pve-devel mailing list