[pve-devel] [PATCH ha-manager v2 6/7] Manager: record tried node on relocation policy
Thomas Lamprecht
t.lamprecht at proxmox.com
Wed Jun 15 17:59:02 CEST 2016
Instead of simply counting up an integer on each failed relocation
trial record the already tried nodes. We still have the try count
through the size of the array, so no information lost and no
behavioural change.
Use this for now to log on which nodes we failed to recover, may be
useful for an user to see that those node fails, so that he can
investigate for which reason and fix those.
Further this prepares us for a more intelligent recovery node
selection, as we can skip already tried nodes from the current
recovery cycle.
With the reuse of the relocate_trials to relocate_tried_nodes this
can happen without any overhead (i.e. additional hash) in the
manager status.
Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
---
no change since last send
src/PVE/HA/Manager.pm | 27 +++++++++++++++++----------
src/test/test-resource-failure2/log.expect | 1 +
src/test/test-resource-failure5/log.expect | 2 +-
3 files changed, 19 insertions(+), 11 deletions(-)
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 1208720..6e30c39 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -349,8 +349,8 @@ sub manage {
}
# remove stale relocation try entries
- foreach my $sid (keys %{$ms->{relocate_trial}}) {
- delete $ms->{relocate_trial}->{$sid} if !$ss->{$sid};
+ foreach my $sid (keys %{$ms->{relocate_tried_nodes}}) {
+ delete $ms->{relocate_tried_nodes}->{$sid} if !$ss->{$sid};
}
$self->update_crm_commands();
@@ -589,31 +589,38 @@ sub next_state_started {
} else {
my $try_next = 0;
+ my $tried_nodes = $master_status->{relocate_tried_nodes}->{$sid} || [];
if ($lrm_res) {
+ # add current service node to failed list
+ push @$tried_nodes, $sd->{node};
+
my $ec = $lrm_res->{exit_code};
if ($ec == SUCCESS) {
- $master_status->{relocate_trial}->{$sid} = 0;
+ if (scalar(@$tried_nodes) > 1) {
+ $haenv->log('info', "relocation policy successful for '$sid'," .
+ " tried nodes: " . join(', ', @$tried_nodes) );
+ }
+
+ delete $master_status->{relocate_tried_nodes}->{$sid};
} elsif ($ec == ERROR) {
# apply our relocate policy if we got ERROR from the LRM
- my $try = $master_status->{relocate_trial}->{$sid} || 0;
-
- if ($try < $cd->{max_relocate}) {
+ if (scalar(@$tried_nodes) <= $cd->{max_relocate}) {
- $try++;
# tell select_service_node to relocate if possible
$try_next = 1;
+ $master_status->{relocate_tried_nodes}->{$sid} = $tried_nodes;
$haenv->log('warning', "starting service $sid on node".
" '$sd->{node}' failed, relocating service.");
- $master_status->{relocate_trial}->{$sid} = $try;
} else {
- $haenv->log('err', "recovery policy for service".
- " $sid failed, entering error state!");
+ $haenv->log('err', "recovery policy for service $sid " .
+ "failed, entering error state. Tried nodes: ".
+ join(', ', @$tried_nodes));
&$change_service_state($self, $sid, 'error');
return;
diff --git a/src/test/test-resource-failure2/log.expect b/src/test/test-resource-failure2/log.expect
index 604ad95..aa34e35 100644
--- a/src/test/test-resource-failure2/log.expect
+++ b/src/test/test-resource-failure2/log.expect
@@ -41,4 +41,5 @@ info 201 node1/lrm: got lock 'ha_agent_node1_lock'
info 201 node1/lrm: status change wait_for_agent_lock => active
info 201 node1/lrm: starting service fa:130
info 201 node1/lrm: service status fa:130 started
+info 220 node1/crm: relocation policy successful for 'fa:130', tried nodes: node2, node1
info 720 hardware: exit simulation - done
diff --git a/src/test/test-resource-failure5/log.expect b/src/test/test-resource-failure5/log.expect
index eb87f9f..a15603e 100644
--- a/src/test/test-resource-failure5/log.expect
+++ b/src/test/test-resource-failure5/log.expect
@@ -28,7 +28,7 @@ warn 123 node2/lrm: restart policy: retry number 1 for service 'fa:130'
info 143 node2/lrm: starting service fa:130
warn 143 node2/lrm: unable to start service fa:130
err 143 node2/lrm: unable to start service fa:130 on local node after 1 retries
-err 160 node1/crm: recovery policy for service fa:130 failed, entering error state!
+err 160 node1/crm: recovery policy for service fa:130 failed, entering error state. Tried nodes: node2
info 160 node1/crm: service 'fa:130': state changed from 'started' to 'error'
err 163 node2/lrm: service fa:130 is in an error state and needs manual intervention. Look up 'ERROR RECOVERY' in the documentation.
info 220 cmdlist: execute service fa:130 disabled
--
2.1.4
More information about the pve-devel
mailing list