[pve-devel] [PATCH ha-manager 2/7] fix LRM error on corrupted manager status

Thu Jul 14 14:41:48 CEST 2016

If the manager stateus file gets corrupted, meaning no valid JSON or
emtpy both LRM and CRM may crash here as decode_json throws an error
then. This then triggers the watchdog if already opened, as it won't
get reset anymore.

Reproducable with:
$ > /etc/pve/ha/manager_status
or
$ echo "garbage" > /etc/pve/ha/manager_status

Put an eval around the decode json call to catch all possible errors
and if such an error happens recover gracefully, i.e. just return an
empty JSON object. As the current master has the state saved in
memory he writes it back on the next loop iteration.
If no CRM is manager at the time it is able to rebuild the state
from scratch.

While this may not happen without external fault we should catch it
as we easily can recover and do not want to fence a few nodes just
because someone wanted to reset the manager status.

Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
---
 src/PVE/HA/Config.pm | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/src/PVE/HA/Config.pm b/src/PVE/HA/Config.pm
index 9ae8d2e..a206561 100644
--- a/src/PVE/HA/Config.pm
+++ b/src/PVE/HA/Config.pm
@@ -38,7 +38,18 @@ cfs_register_file($ha_fence_config,
 sub json_reader {
     my ($filename, $data) = @_;
 
-    return defined($data) ? decode_json($data) : {};
+    # manager can rebuild and then write the config if it got corrupted,
+    # so catch all errors and just warn about them.
+    my $res;
+    eval {
+	$res = decode_json($data)
+    };
+    if (my $err = $@) {
+	warn "Could not decode json: $err\n";
+	$res = {};
+    }
+
+    return $res;
 }
 
 sub json_writer {
-- 
2.1.4