[PVE-User] Enabling telemetry broke all my ceph managers

Thu Jun 18 13:30:38 CEST 2020

Clean nautilous install I setup last week

  * 5 Proxmox nodes
      o All on latest updates via no-subscription channel
  * 18 OSD's
  * 3 Managers
  * 3 Monitors
  * Cluster Heal good
  * In a protracted rebalance phase
  * All managed via proxmox

I thought I would enable telemetry for caph as per this article:

https://docs.ceph.com/docs/master/mgr/telemetry/

  * Enabled the module (command line)
  * ceph telemetry on
  * Tested getting the status
  * Set the contact and description
    ceph config set mgr mgr/telemetry/contact 'John Doe
    <john.doe at example.com>'
    ceph config set mgr mgr/telemetry/description 'My first Ceph cluster'
    ceph config set mgr mgr/telemetry/channel_ident true
  * Tried sending it
    ceph telemetry send

I *think* this is when the managers died, but it could have been 
earlier. But around then the all ceph IO stopped and I discovered all 
three managers had crashed and would not restart. I was shitting myself 
because this was remote and the router is a pfSense VM :) Fortunately it 
kept going without its disk responding.

systemctl start ceph-mgr at vni.service
Job for ceph-mgr at vni.service failed because the control process exited 
with error code.
See "systemctl status ceph-mgr at vni.service" and "journalctl -xe" for 
details.

 From journalcontrol -xe

    -- The unit ceph-mgr at vni.service has entered the 'failed' state with
    result 'exit-code'.
    Jun 18 21:02:25 vni systemd[1]: Failed to start Ceph cluster manager
    daemon.
    -- Subject: A start job for unit ceph-mgr at vni.service has failed
    -- Defined-By: systemd
    -- Support: https://www.debian.org/support
    --
    -- A start job for unit ceph-mgr at vni.service has finished with a
    failure.
    --
    -- The job identifier is 91690 and the job result is failed.

 From systemctl status ceph-mgr at vni.service

ceph-mgr at vni.service - Ceph cluster manager daemon
    Loaded: loaded (/lib/systemd/system/ceph-mgr at .service; enabled; 
vendor preset: enabled)
   Drop-In: /lib/systemd/system/ceph-mgr at .service.d
            └─ceph-after-pve-cluster.conf
    Active: failed (Result: exit-code) since Thu 2020-06-18 20:53:52 
AEST; 8min ago
   Process: 415566 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER} 
--id vni --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
  Main PID: 415566 (code=exited, status=1/FAILURE)

Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Service 
RestartSec=10s expired, scheduling restart.
Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Scheduled restart 
job, restart counter is at 4.
Jun 18 20:53:52 vni systemd[1]: Stopped Ceph cluster manager daemon.
Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Start request 
repeated too quickly.
Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Failed with result 
'exit-code'.
Jun 18 20:53:52 vni systemd[1]: Failed to start Ceph cluster manager daemon.

I created a new manager service on an unused node and fortunately that 
worked. I deleted/recreated the old managers and they started working. 
It was a sweaty few minutes :)

Everything resumed without a hiccup after that, impressed. Not game to 
try and reproduce it though.

-- 
Lindsay