[PVE-User] Enabling telemetry broke all my ceph managers
Lindsay Mathieson
lindsay.mathieson at gmail.com
Thu Jun 18 13:30:38 CEST 2020
Clean nautilous install I setup last week
* 5 Proxmox nodes
o All on latest updates via no-subscription channel
* 18 OSD's
* 3 Managers
* 3 Monitors
* Cluster Heal good
* In a protracted rebalance phase
* All managed via proxmox
I thought I would enable telemetry for caph as per this article:
https://docs.ceph.com/docs/master/mgr/telemetry/
* Enabled the module (command line)
* ceph telemetry on
* Tested getting the status
* Set the contact and description
ceph config set mgr mgr/telemetry/contact 'John Doe
<john.doe at example.com>'
ceph config set mgr mgr/telemetry/description 'My first Ceph cluster'
ceph config set mgr mgr/telemetry/channel_ident true
* Tried sending it
ceph telemetry send
I *think* this is when the managers died, but it could have been
earlier. But around then the all ceph IO stopped and I discovered all
three managers had crashed and would not restart. I was shitting myself
because this was remote and the router is a pfSense VM :) Fortunately it
kept going without its disk responding.
systemctl start ceph-mgr at vni.service
Job for ceph-mgr at vni.service failed because the control process exited
with error code.
See "systemctl status ceph-mgr at vni.service" and "journalctl -xe" for
details.
From journalcontrol -xe
-- The unit ceph-mgr at vni.service has entered the 'failed' state with
result 'exit-code'.
Jun 18 21:02:25 vni systemd[1]: Failed to start Ceph cluster manager
daemon.
-- Subject: A start job for unit ceph-mgr at vni.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A start job for unit ceph-mgr at vni.service has finished with a
failure.
--
-- The job identifier is 91690 and the job result is failed.
From systemctl status ceph-mgr at vni.service
ceph-mgr at vni.service - Ceph cluster manager daemon
Loaded: loaded (/lib/systemd/system/ceph-mgr at .service; enabled;
vendor preset: enabled)
Drop-In: /lib/systemd/system/ceph-mgr at .service.d
└─ceph-after-pve-cluster.conf
Active: failed (Result: exit-code) since Thu 2020-06-18 20:53:52
AEST; 8min ago
Process: 415566 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER}
--id vni --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 415566 (code=exited, status=1/FAILURE)
Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Service
RestartSec=10s expired, scheduling restart.
Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Scheduled restart
job, restart counter is at 4.
Jun 18 20:53:52 vni systemd[1]: Stopped Ceph cluster manager daemon.
Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Start request
repeated too quickly.
Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Failed with result
'exit-code'.
Jun 18 20:53:52 vni systemd[1]: Failed to start Ceph cluster manager daemon.
I created a new manager service on an unused node and fortunately that
worked. I deleted/recreated the old managers and they started working.
It was a sweaty few minutes :)
Everything resumed without a hiccup after that, impressed. Not game to
try and reproduce it though.
--
Lindsay
More information about the pve-user
mailing list