[pve-devel] [PATCH ha-manager 2/4] allow use of external fencing devices
Thomas Lamprecht
t.lamprecht at proxmox.com
Wed Mar 27 17:42:04 CET 2019
A node now can be fenced with the use of external hardware fence
devices.
Those device can be configured at /etc/pve/ha/fence.cfg
also the fencing option in the datacenter configuration file must
be set to either 'hardware' or 'both', else configured devices
will *not* be used.
If hardware is selected as mode a valid device config *must* be
made, the fencing will not be marked as successful even if the CRM
could theoretical acquire the lock from the failed node!
This is done as some setups may require a HW fence agent to cut the
node off and where a watchdog which resets the node may be
dangerous.
We always *must* acquire the log before we can mark the failed node
as fenced and place it into the 'unknown' state and recover its
services.
The CRM bails out in case of an lost manager lock event where
$manager->cleanup() gets called.
There we kill all remaining open fence processes, if any,
and reset the fence status.
The currents masters manager class processes the running fencing
jobs, this means picking up finished fence workers and evaluating
their result.
Now regressions test with faked virtual HW fence devices are also
possible.
The current virtual devices succeed always, this will be changed
in a future patch to allow testing of more (dangerous) corner cases.
Device can be configured in the testdir/fence.cfg file and follow
the exactly same format as the real ones (see man dlm.conf)
---
src/PVE/HA/Manager.pm | 4 +-
src/PVE/HA/NodeStatus.pm | 54 +++++++++++-
src/test/test-hw-fence1/README | 1 +
src/test/test-hw-fence1/cmdlist | 4 +
src/test/test-hw-fence1/fence.cfg | 6 ++
src/test/test-hw-fence1/hardware_status | 5 ++
src/test/test-hw-fence1/log.expect | 53 ++++++++++++
src/test/test-hw-fence1/manager_status | 1 +
src/test/test-hw-fence1/service_config | 5 ++
src/test/test-hw-fence2/README | 3 +
src/test/test-hw-fence2/cmdlist | 5 ++
src/test/test-hw-fence2/fence.cfg | 8 ++
src/test/test-hw-fence2/hardware_status | 7 ++
src/test/test-hw-fence2/log.expect | 110 ++++++++++++++++++++++++
src/test/test-hw-fence2/manager_status | 1 +
src/test/test-hw-fence2/service_config | 10 +++
src/test/test-hw-fence3/README | 5 ++
src/test/test-hw-fence3/cmdlist | 4 +
src/test/test-hw-fence3/fence.cfg | 17 ++++
src/test/test-hw-fence3/hardware_status | 5 ++
src/test/test-hw-fence3/log.expect | 57 ++++++++++++
src/test/test-hw-fence3/manager_status | 1 +
src/test/test-hw-fence3/service_config | 5 ++
23 files changed, 369 insertions(+), 2 deletions(-)
create mode 100644 src/test/test-hw-fence1/README
create mode 100644 src/test/test-hw-fence1/cmdlist
create mode 100644 src/test/test-hw-fence1/fence.cfg
create mode 100644 src/test/test-hw-fence1/hardware_status
create mode 100644 src/test/test-hw-fence1/log.expect
create mode 100644 src/test/test-hw-fence1/manager_status
create mode 100644 src/test/test-hw-fence1/service_config
create mode 100644 src/test/test-hw-fence2/README
create mode 100644 src/test/test-hw-fence2/cmdlist
create mode 100644 src/test/test-hw-fence2/fence.cfg
create mode 100644 src/test/test-hw-fence2/hardware_status
create mode 100644 src/test/test-hw-fence2/log.expect
create mode 100644 src/test/test-hw-fence2/manager_status
create mode 100644 src/test/test-hw-fence2/service_config
create mode 100644 src/test/test-hw-fence3/README
create mode 100644 src/test/test-hw-fence3/cmdlist
create mode 100644 src/test/test-hw-fence3/fence.cfg
create mode 100644 src/test/test-hw-fence3/hardware_status
create mode 100644 src/test/test-hw-fence3/log.expect
create mode 100644 src/test/test-hw-fence3/manager_status
create mode 100644 src/test/test-hw-fence3/service_config
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index a6c9b8e..177dfc2 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -7,6 +7,7 @@ use Digest::MD5 qw(md5_base64);
use PVE::Tools;
use PVE::HA::Tools ':exit_codes';
use PVE::HA::NodeStatus;
+use PVE::HA::Fence;
sub new {
my ($this, $haenv) = @_;
@@ -32,7 +33,8 @@ sub new {
sub cleanup {
my ($self) = @_;
- # todo: ?
+ # reset pending fence jobs and node states
+ $self->{ns}->cleanup();
}
sub flush_master_status {
diff --git a/src/PVE/HA/NodeStatus.pm b/src/PVE/HA/NodeStatus.pm
index 940d903..ca13f2f 100644
--- a/src/PVE/HA/NodeStatus.pm
+++ b/src/PVE/HA/NodeStatus.pm
@@ -2,6 +2,7 @@ package PVE::HA::NodeStatus;
use strict;
use warnings;
+use PVE::HA::Fence;
use JSON;
@@ -12,8 +13,11 @@ sub new {
my $class = ref($this) || $this;
+ my $fencer = PVE::HA::Fence->new($haenv);
+
my $self = bless {
haenv => $haenv,
+ fencer => $fencer,
status => $status,
last_online => {},
}, $class;
@@ -206,6 +210,7 @@ sub fence_node {
my ($self, $node) = @_;
my $haenv = $self->{haenv};
+ my $fencer = $self->{fencer};
my $state = $self->get_node_state($node);
@@ -215,16 +220,63 @@ sub fence_node {
&$send_fence_state_email($self, 'FENCE', $msg, $node);
}
- my $success = $haenv->get_ha_agent_lock($node);
+ my ($success, $hw_fence_success) = (0, 0);
+
+ my $fencing_mode = $haenv->fencing_mode();
+
+ if ($fencing_mode eq 'hardware' || $fencing_mode eq 'both') {
+
+ $hw_fence_success = $fencer->is_node_fenced($node);
+
+ # bad fence.cfg or no devices and only hardware fencing configured
+ if ($hw_fence_success < 0 && $fencing_mode eq 'hardware') {
+ $haenv->log('err', "Fencing of node '$node' failed and needs " .
+ "manual intervention!");
+ return 0;
+ }
+
+ if ($hw_fence_success > 0) {
+ # we fenced the node, now we're allowed to "steal" its lock
+ $haenv->log('notice', "fencing of node '$node' succeeded, " .
+ "trying to get its agent lock");
+ # This may only be done after successfully fencing node!
+ $haenv->release_ha_agent_lock($node);
+
+ } else {
+
+ # start and process fencing
+ $fencer->run_fence_jobs($node);
+
+ }
+ }
+
+ # we *always* need the failed nodes lock, it secures that we are allowed to
+ # recover its services and prevents races, e.g. if it's restarting.
+ if ($hw_fence_success || $fencing_mode ne 'hardware' ) {
+ $success = $haenv->get_ha_agent_lock($node);
+ }
if ($success) {
my $msg = "fencing: acknowledged - got agent lock for node '$node'";
$haenv->log("info", $msg);
&$set_node_state($self, $node, 'unknown');
&$send_fence_state_email($self, 'SUCCEED', $msg, $node);
+ $fencer->kill_and_cleanup_jobs($node) if ($fencing_mode ne 'watchdog');
}
return $success;
}
+sub cleanup {
+ my ($self) = @_;
+
+ my $haenv = $self->{haenv};
+ my $fencer = $self->{fencer};
+
+ if ($fencer->has_fencing_job($haenv->nodename())) {
+ $haenv->log('notice', "bailing out from running fence jobs");
+ $fencer->kill_and_cleanup_jobs();
+ }
+}
+
1;
diff --git a/src/test/test-hw-fence1/README b/src/test/test-hw-fence1/README
new file mode 100644
index 0000000..d0dea4b
--- /dev/null
+++ b/src/test/test-hw-fence1/README
@@ -0,0 +1 @@
+Test failover after single node network failure with HW fence devices.
diff --git a/src/test/test-hw-fence1/cmdlist b/src/test/test-hw-fence1/cmdlist
new file mode 100644
index 0000000..eee0e40
--- /dev/null
+++ b/src/test/test-hw-fence1/cmdlist
@@ -0,0 +1,4 @@
+[
+ [ "power node1 on", "power node2 on", "power node3 on"],
+ [ "network node3 off" ]
+]
diff --git a/src/test/test-hw-fence1/fence.cfg b/src/test/test-hw-fence1/fence.cfg
new file mode 100644
index 0000000..0bbe096
--- /dev/null
+++ b/src/test/test-hw-fence1/fence.cfg
@@ -0,0 +1,6 @@
+# see man dlm.conf
+device virt fence_virt ip="127.0.0.1"
+connect virt node=node1 plug=100
+connect virt node=node2 plug=101
+connect virt node=node3 plug=102
+
diff --git a/src/test/test-hw-fence1/hardware_status b/src/test/test-hw-fence1/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-hw-fence1/hardware_status
@@ -0,0 +1,5 @@
+{
+ "node1": { "power": "off", "network": "off" },
+ "node2": { "power": "off", "network": "off" },
+ "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-hw-fence1/log.expect b/src/test/test-hw-fence1/log.expect
new file mode 100644
index 0000000..8cd8a40
--- /dev/null
+++ b/src/test/test-hw-fence1/log.expect
@@ -0,0 +1,53 @@
+info 0 hardware: starting simulation
+info 20 cmdlist: execute power node1 on
+info 20 node1/crm: status change startup => wait_for_quorum
+info 20 node1/lrm: status change startup => wait_for_agent_lock
+info 20 cmdlist: execute power node2 on
+info 20 node2/crm: status change startup => wait_for_quorum
+info 20 node2/lrm: status change startup => wait_for_agent_lock
+info 20 cmdlist: execute power node3 on
+info 20 node3/crm: status change startup => wait_for_quorum
+info 20 node3/lrm: status change startup => wait_for_agent_lock
+info 20 node1/crm: got lock 'ha_manager_lock'
+info 20 node1/crm: status change wait_for_quorum => master
+info 20 node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info 20 node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info 20 node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info 20 node1/crm: adding new service 'vm:101' on node 'node1'
+info 20 node1/crm: adding new service 'vm:102' on node 'node2'
+info 20 node1/crm: adding new service 'vm:103' on node 'node3'
+info 21 node1/lrm: got lock 'ha_agent_node1_lock'
+info 21 node1/lrm: status change wait_for_agent_lock => active
+info 21 node1/lrm: starting service vm:101
+info 21 node1/lrm: service status vm:101 started
+info 22 node2/crm: status change wait_for_quorum => slave
+info 23 node2/lrm: got lock 'ha_agent_node2_lock'
+info 23 node2/lrm: status change wait_for_agent_lock => active
+info 24 node3/crm: status change wait_for_quorum => slave
+info 25 node3/lrm: got lock 'ha_agent_node3_lock'
+info 25 node3/lrm: status change wait_for_agent_lock => active
+info 25 node3/lrm: starting service vm:103
+info 25 node3/lrm: service status vm:103 started
+info 40 node1/crm: service 'vm:102': state changed from 'request_stop' to 'stopped'
+info 120 cmdlist: execute network node3 off
+info 120 node1/crm: node 'node3': state changed from 'online' => 'unknown'
+info 124 node3/crm: status change slave => wait_for_quorum
+info 125 node3/lrm: status change active => lost_agent_lock
+info 160 node1/crm: service 'vm:103': state changed from 'started' to 'fence'
+info 160 node1/crm: node 'node3': state changed from 'unknown' => 'fence'
+emai 160 node1/crm: FENCE: Try to fence node 'node3'
+noti 160 node1/crm: Start fencing node 'node3'
+noti 160 node1/crm: [fence 'node3'] execute cmd: fence_virt --ip=127.0.0.1 --plug=102
+info 160 fence_virt: execute power node3 off
+info 160 node3/crm: killed by poweroff
+info 160 node3/lrm: killed by poweroff
+noti 180 node1/crm: fencing of node 'node3' succeeded, trying to get its agent lock
+info 180 node1/crm: got lock 'ha_agent_node3_lock'
+info 180 node1/crm: fencing: acknowledged - got agent lock for node 'node3'
+info 180 node1/crm: node 'node3': state changed from 'fence' => 'unknown'
+emai 180 node1/crm: SUCCEED: fencing: acknowledged - got agent lock for node 'node3'
+info 180 node1/crm: recover service 'vm:103' from fenced node 'node3' to node 'node2'
+info 180 node1/crm: service 'vm:103': state changed from 'fence' to 'started' (node = node2)
+info 183 node2/lrm: starting service vm:103
+info 183 node2/lrm: service status vm:103 started
+info 720 hardware: exit simulation - done
diff --git a/src/test/test-hw-fence1/manager_status b/src/test/test-hw-fence1/manager_status
new file mode 100644
index 0000000..0967ef4
--- /dev/null
+++ b/src/test/test-hw-fence1/manager_status
@@ -0,0 +1 @@
+{}
diff --git a/src/test/test-hw-fence1/service_config b/src/test/test-hw-fence1/service_config
new file mode 100644
index 0000000..70f11d6
--- /dev/null
+++ b/src/test/test-hw-fence1/service_config
@@ -0,0 +1,5 @@
+{
+ "vm:101": { "node": "node1", "state": "enabled" },
+ "vm:102": { "node": "node2" },
+ "vm:103": { "node": "node3", "state": "enabled" }
+}
diff --git a/src/test/test-hw-fence2/README b/src/test/test-hw-fence2/README
new file mode 100644
index 0000000..a3814ec
--- /dev/null
+++ b/src/test/test-hw-fence2/README
@@ -0,0 +1,3 @@
+Test HW fencing and failover after the network of two nodes fails.
+This test if the HW fence mechanism can cope with multiple nodes failing
+simultaneously, as long there is still quorum.
diff --git a/src/test/test-hw-fence2/cmdlist b/src/test/test-hw-fence2/cmdlist
new file mode 100644
index 0000000..9a8bb59
--- /dev/null
+++ b/src/test/test-hw-fence2/cmdlist
@@ -0,0 +1,5 @@
+[
+ [ "power node1 on", "power node2 on", "power node3 on",
+ "power node4 on", "power node5 on"],
+ [ "network node3 off", "network node4 off" ]
+]
diff --git a/src/test/test-hw-fence2/fence.cfg b/src/test/test-hw-fence2/fence.cfg
new file mode 100644
index 0000000..aedcdc3
--- /dev/null
+++ b/src/test/test-hw-fence2/fence.cfg
@@ -0,0 +1,8 @@
+# see man dlm.conf
+device virt fence_virt ip="127.0.0.1"
+connect virt node=node1 plug=100
+connect virt node=node2 plug=101
+connect virt node=node3 plug=102
+connect virt node=node4 plug=103
+connect virt node=node5 plug=104
+
diff --git a/src/test/test-hw-fence2/hardware_status b/src/test/test-hw-fence2/hardware_status
new file mode 100644
index 0000000..7b8e961
--- /dev/null
+++ b/src/test/test-hw-fence2/hardware_status
@@ -0,0 +1,7 @@
+{
+ "node1": { "power": "off", "network": "off" },
+ "node2": { "power": "off", "network": "off" },
+ "node3": { "power": "off", "network": "off" },
+ "node4": { "power": "off", "network": "off" },
+ "node5": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-hw-fence2/log.expect b/src/test/test-hw-fence2/log.expect
new file mode 100644
index 0000000..0eadd78
--- /dev/null
+++ b/src/test/test-hw-fence2/log.expect
@@ -0,0 +1,110 @@
+info 0 hardware: starting simulation
+info 20 cmdlist: execute power node1 on
+info 20 node1/crm: status change startup => wait_for_quorum
+info 20 node1/lrm: status change startup => wait_for_agent_lock
+info 20 cmdlist: execute power node2 on
+info 20 node2/crm: status change startup => wait_for_quorum
+info 20 node2/lrm: status change startup => wait_for_agent_lock
+info 20 cmdlist: execute power node3 on
+info 20 node3/crm: status change startup => wait_for_quorum
+info 20 node3/lrm: status change startup => wait_for_agent_lock
+info 20 cmdlist: execute power node4 on
+info 20 node4/crm: status change startup => wait_for_quorum
+info 20 node4/lrm: status change startup => wait_for_agent_lock
+info 20 cmdlist: execute power node5 on
+info 20 node5/crm: status change startup => wait_for_quorum
+info 20 node5/lrm: status change startup => wait_for_agent_lock
+info 20 node1/crm: got lock 'ha_manager_lock'
+info 20 node1/crm: status change wait_for_quorum => master
+info 20 node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info 20 node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info 20 node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info 20 node1/crm: node 'node4': state changed from 'unknown' => 'online'
+info 20 node1/crm: node 'node5': state changed from 'unknown' => 'online'
+info 20 node1/crm: adding new service 'vm:101' on node 'node4'
+info 20 node1/crm: adding new service 'vm:102' on node 'node3'
+info 20 node1/crm: adding new service 'vm:103' on node 'node3'
+info 20 node1/crm: adding new service 'vm:104' on node 'node3'
+info 20 node1/crm: adding new service 'vm:105' on node 'node4'
+info 22 node2/crm: status change wait_for_quorum => slave
+info 24 node3/crm: status change wait_for_quorum => slave
+info 25 node3/lrm: got lock 'ha_agent_node3_lock'
+info 25 node3/lrm: status change wait_for_agent_lock => active
+info 25 node3/lrm: starting service vm:102
+info 25 node3/lrm: service status vm:102 started
+info 25 node3/lrm: starting service vm:103
+info 25 node3/lrm: service status vm:103 started
+info 25 node3/lrm: starting service vm:104
+info 25 node3/lrm: service status vm:104 started
+info 26 node4/crm: status change wait_for_quorum => slave
+info 27 node4/lrm: got lock 'ha_agent_node4_lock'
+info 27 node4/lrm: status change wait_for_agent_lock => active
+info 27 node4/lrm: starting service vm:101
+info 27 node4/lrm: service status vm:101 started
+info 27 node4/lrm: starting service vm:105
+info 27 node4/lrm: service status vm:105 started
+info 28 node5/crm: status change wait_for_quorum => slave
+info 120 cmdlist: execute network node3 off
+info 120 cmdlist: execute network node4 off
+info 120 node1/crm: node 'node3': state changed from 'online' => 'unknown'
+info 120 node1/crm: node 'node4': state changed from 'online' => 'unknown'
+info 124 node3/crm: status change slave => wait_for_quorum
+info 125 node3/lrm: status change active => lost_agent_lock
+info 126 node4/crm: status change slave => wait_for_quorum
+info 127 node4/lrm: status change active => lost_agent_lock
+info 160 node1/crm: service 'vm:101': state changed from 'started' to 'fence'
+info 160 node1/crm: service 'vm:102': state changed from 'started' to 'fence'
+info 160 node1/crm: service 'vm:103': state changed from 'started' to 'fence'
+info 160 node1/crm: service 'vm:104': state changed from 'started' to 'fence'
+info 160 node1/crm: service 'vm:105': state changed from 'started' to 'fence'
+info 160 node1/crm: node 'node4': state changed from 'unknown' => 'fence'
+emai 160 node1/crm: FENCE: Try to fence node 'node4'
+noti 160 node1/crm: Start fencing node 'node4'
+noti 160 node1/crm: [fence 'node4'] execute cmd: fence_virt --ip=127.0.0.1 --plug=103
+info 160 fence_virt: execute power node4 off
+info 160 node4/crm: killed by poweroff
+info 160 node4/lrm: killed by poweroff
+info 160 node1/crm: node 'node3': state changed from 'unknown' => 'fence'
+emai 160 node1/crm: FENCE: Try to fence node 'node3'
+noti 160 node1/crm: Start fencing node 'node3'
+noti 160 node1/crm: [fence 'node3'] execute cmd: fence_virt --ip=127.0.0.1 --plug=102
+info 160 fence_virt: execute power node3 off
+info 160 node3/crm: killed by poweroff
+info 160 node3/lrm: killed by poweroff
+noti 160 node1/crm: fencing of node 'node3' succeeded, trying to get its agent lock
+info 160 node1/crm: got lock 'ha_agent_node3_lock'
+info 160 node1/crm: fencing: acknowledged - got agent lock for node 'node3'
+info 160 node1/crm: node 'node3': state changed from 'fence' => 'unknown'
+emai 160 node1/crm: SUCCEED: fencing: acknowledged - got agent lock for node 'node3'
+info 160 node1/crm: recover service 'vm:102' from fenced node 'node3' to node 'node1'
+info 160 node1/crm: service 'vm:102': state changed from 'fence' to 'started' (node = node1)
+info 160 node1/crm: recover service 'vm:103' from fenced node 'node3' to node 'node2'
+info 160 node1/crm: service 'vm:103': state changed from 'fence' to 'started' (node = node2)
+info 160 node1/crm: recover service 'vm:104' from fenced node 'node3' to node 'node5'
+info 160 node1/crm: service 'vm:104': state changed from 'fence' to 'started' (node = node5)
+info 161 node1/lrm: got lock 'ha_agent_node1_lock'
+info 161 node1/lrm: status change wait_for_agent_lock => active
+info 161 node1/lrm: starting service vm:102
+info 161 node1/lrm: service status vm:102 started
+info 163 node2/lrm: got lock 'ha_agent_node2_lock'
+info 163 node2/lrm: status change wait_for_agent_lock => active
+info 163 node2/lrm: starting service vm:103
+info 163 node2/lrm: service status vm:103 started
+info 165 node5/lrm: got lock 'ha_agent_node5_lock'
+info 165 node5/lrm: status change wait_for_agent_lock => active
+info 165 node5/lrm: starting service vm:104
+info 165 node5/lrm: service status vm:104 started
+noti 180 node1/crm: fencing of node 'node4' succeeded, trying to get its agent lock
+info 180 node1/crm: got lock 'ha_agent_node4_lock'
+info 180 node1/crm: fencing: acknowledged - got agent lock for node 'node4'
+info 180 node1/crm: node 'node4': state changed from 'fence' => 'unknown'
+emai 180 node1/crm: SUCCEED: fencing: acknowledged - got agent lock for node 'node4'
+info 180 node1/crm: recover service 'vm:101' from fenced node 'node4' to node 'node1'
+info 180 node1/crm: service 'vm:101': state changed from 'fence' to 'started' (node = node1)
+info 180 node1/crm: recover service 'vm:105' from fenced node 'node4' to node 'node2'
+info 180 node1/crm: service 'vm:105': state changed from 'fence' to 'started' (node = node2)
+info 181 node1/lrm: starting service vm:101
+info 181 node1/lrm: service status vm:101 started
+info 183 node2/lrm: starting service vm:105
+info 183 node2/lrm: service status vm:105 started
+info 720 hardware: exit simulation - done
diff --git a/src/test/test-hw-fence2/manager_status b/src/test/test-hw-fence2/manager_status
new file mode 100644
index 0000000..0967ef4
--- /dev/null
+++ b/src/test/test-hw-fence2/manager_status
@@ -0,0 +1 @@
+{}
diff --git a/src/test/test-hw-fence2/service_config b/src/test/test-hw-fence2/service_config
new file mode 100644
index 0000000..735e219
--- /dev/null
+++ b/src/test/test-hw-fence2/service_config
@@ -0,0 +1,10 @@
+{
+ "vm:101": { "node": "node3", "state": "enabled" },
+ "vm:102": { "node": "node3", "state": "enabled" },
+ "vm:103": { "node": "node3", "state": "enabled" },
+ "vm:104": { "node": "node3", "state": "enabled" },
+ "vm:101": { "node": "node3", "state": "enabled" },
+ "vm:101": { "node": "node4", "state": "enabled" },
+ "vm:101": { "node": "node4", "state": "enabled" },
+ "vm:105": { "node": "node4", "state": "enabled" }
+}
diff --git a/src/test/test-hw-fence3/README b/src/test/test-hw-fence3/README
new file mode 100644
index 0000000..206ee5f
--- /dev/null
+++ b/src/test/test-hw-fence3/README
@@ -0,0 +1,5 @@
+Test failover after single node network failure with parallel HW fence devices.
+As the simulated Environment is limited to a single powerplug you will see the
+"killed by poweroff" only once, more important is that the three fence agents
+get executed and that the node is fenced only after all three successfully
+finsihed.
diff --git a/src/test/test-hw-fence3/cmdlist b/src/test/test-hw-fence3/cmdlist
new file mode 100644
index 0000000..eee0e40
--- /dev/null
+++ b/src/test/test-hw-fence3/cmdlist
@@ -0,0 +1,4 @@
+[
+ [ "power node1 on", "power node2 on", "power node3 on"],
+ [ "network node3 off" ]
+]
diff --git a/src/test/test-hw-fence3/fence.cfg b/src/test/test-hw-fence3/fence.cfg
new file mode 100644
index 0000000..60645df
--- /dev/null
+++ b/src/test/test-hw-fence3/fence.cfg
@@ -0,0 +1,17 @@
+# see man dlm.conf
+device virt:1 fence_virt ip="127.0.0.1"
+device virt:2 fence_virt ip="127.0.0.2"
+device virt:3 fence_virt ip="127.0.0.3"
+
+connect virt:1 node=node1 plug=100
+connect virt:2 node=node1 plug=100
+connect virt:3 node=node1 plug=100
+
+connect virt:1 node=node2 plug=101
+connect virt:2 node=node2 plug=101
+connect virt:3 node=node2 plug=101
+
+connect virt:1 node=node3 plug=102
+connect virt:2 node=node3 plug=102
+connect virt:3 node=node3 plug=102
+
diff --git a/src/test/test-hw-fence3/hardware_status b/src/test/test-hw-fence3/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-hw-fence3/hardware_status
@@ -0,0 +1,5 @@
+{
+ "node1": { "power": "off", "network": "off" },
+ "node2": { "power": "off", "network": "off" },
+ "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-hw-fence3/log.expect b/src/test/test-hw-fence3/log.expect
new file mode 100644
index 0000000..bb0760b
--- /dev/null
+++ b/src/test/test-hw-fence3/log.expect
@@ -0,0 +1,57 @@
+info 0 hardware: starting simulation
+info 20 cmdlist: execute power node1 on
+info 20 node1/crm: status change startup => wait_for_quorum
+info 20 node1/lrm: status change startup => wait_for_agent_lock
+info 20 cmdlist: execute power node2 on
+info 20 node2/crm: status change startup => wait_for_quorum
+info 20 node2/lrm: status change startup => wait_for_agent_lock
+info 20 cmdlist: execute power node3 on
+info 20 node3/crm: status change startup => wait_for_quorum
+info 20 node3/lrm: status change startup => wait_for_agent_lock
+info 20 node1/crm: got lock 'ha_manager_lock'
+info 20 node1/crm: status change wait_for_quorum => master
+info 20 node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info 20 node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info 20 node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info 20 node1/crm: adding new service 'vm:101' on node 'node1'
+info 20 node1/crm: adding new service 'vm:102' on node 'node2'
+info 20 node1/crm: adding new service 'vm:103' on node 'node3'
+info 21 node1/lrm: got lock 'ha_agent_node1_lock'
+info 21 node1/lrm: status change wait_for_agent_lock => active
+info 21 node1/lrm: starting service vm:101
+info 21 node1/lrm: service status vm:101 started
+info 22 node2/crm: status change wait_for_quorum => slave
+info 23 node2/lrm: got lock 'ha_agent_node2_lock'
+info 23 node2/lrm: status change wait_for_agent_lock => active
+info 24 node3/crm: status change wait_for_quorum => slave
+info 25 node3/lrm: got lock 'ha_agent_node3_lock'
+info 25 node3/lrm: status change wait_for_agent_lock => active
+info 25 node3/lrm: starting service vm:103
+info 25 node3/lrm: service status vm:103 started
+info 40 node1/crm: service 'vm:102': state changed from 'request_stop' to 'stopped'
+info 120 cmdlist: execute network node3 off
+info 120 node1/crm: node 'node3': state changed from 'online' => 'unknown'
+info 124 node3/crm: status change slave => wait_for_quorum
+info 125 node3/lrm: status change active => lost_agent_lock
+info 160 node1/crm: service 'vm:103': state changed from 'started' to 'fence'
+info 160 node1/crm: node 'node3': state changed from 'unknown' => 'fence'
+emai 160 node1/crm: FENCE: Try to fence node 'node3'
+noti 160 node1/crm: Start fencing node 'node3'
+noti 160 node1/crm: [fence 'node3'] execute cmd: fence_virt --ip=127.0.0.1 --plug=102
+info 160 fence_virt: execute power node3 off
+info 160 node3/crm: killed by poweroff
+info 160 node3/lrm: killed by poweroff
+noti 160 node1/crm: [fence 'node3'] execute cmd: fence_virt --ip=127.0.0.2 --plug=102
+info 160 fence_virt: execute power node3 off
+noti 160 node1/crm: [fence 'node3'] execute cmd: fence_virt --ip=127.0.0.3 --plug=102
+info 160 fence_virt: execute power node3 off
+noti 180 node1/crm: fencing of node 'node3' succeeded, trying to get its agent lock
+info 180 node1/crm: got lock 'ha_agent_node3_lock'
+info 180 node1/crm: fencing: acknowledged - got agent lock for node 'node3'
+info 180 node1/crm: node 'node3': state changed from 'fence' => 'unknown'
+emai 180 node1/crm: SUCCEED: fencing: acknowledged - got agent lock for node 'node3'
+info 180 node1/crm: recover service 'vm:103' from fenced node 'node3' to node 'node2'
+info 180 node1/crm: service 'vm:103': state changed from 'fence' to 'started' (node = node2)
+info 183 node2/lrm: starting service vm:103
+info 183 node2/lrm: service status vm:103 started
+info 720 hardware: exit simulation - done
diff --git a/src/test/test-hw-fence3/manager_status b/src/test/test-hw-fence3/manager_status
new file mode 100644
index 0000000..0967ef4
--- /dev/null
+++ b/src/test/test-hw-fence3/manager_status
@@ -0,0 +1 @@
+{}
diff --git a/src/test/test-hw-fence3/service_config b/src/test/test-hw-fence3/service_config
new file mode 100644
index 0000000..70f11d6
--- /dev/null
+++ b/src/test/test-hw-fence3/service_config
@@ -0,0 +1,5 @@
+{
+ "vm:101": { "node": "node1", "state": "enabled" },
+ "vm:102": { "node": "node2" },
+ "vm:103": { "node": "node3", "state": "enabled" }
+}
--
2.20.1
More information about the pve-devel
mailing list