[pve-devel] [PATCH ha-manager] get_pve_lock: allow retrying if pmxcfs is offline
Thomas Lamprecht
t.lamprecht at proxmox.com
Tue Nov 15 09:34:43 CET 2016
Our cluster filesystem is for a short time offline when its package
(pve-cluster) gets updated.
If them LRM or CRM call the get_protected_ha_*_lock during such a
time it run into this check and died.
As a result we assumed that we lost our lock and change in the
'lost_agent_lock' state. Then the watchdog updates were stopped to
allow selfencing.
This fencing is completely unnecessary, so instead of a die log the
error and try again, if after 5 tries (= 5 seconds) it still isn't
mounted we can assume that pmxcfs is dead and swicth in the
'lost_agent_lock' state.
Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
---
can be triggered with an active LRM and a few
# systemctl restart pve-cluster
commands
src/PVE/HA/Env/PVE2.pm | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/src/PVE/HA/Env/PVE2.pm b/src/PVE/HA/Env/PVE2.pm
index 6b8802e..e099904 100644
--- a/src/PVE/HA/Env/PVE2.pm
+++ b/src/PVE/HA/Env/PVE2.pm
@@ -227,7 +227,10 @@ sub get_pve_lock {
mkdir $lockdir;
# pve cluster filesystem not online
- die "can't create '$lockdir' (pmxcfs not mounted?)\n" if ! -d $lockdir;
+ if (! -d $lockdir) {
+ $self->log('err', "can't create '$lockdir' (pmxcfs not mounted?)");
+ return 0;
+ }
if ($last && (($ctime - $last) < $retry_timeout)) {
# send cfs lock update request (utime)
--
2.1.4
More information about the pve-devel
mailing list