[PVE-User] Enabling telemetry broke all my ceph managers
Brian :
brians at iptel.co
Fri Jun 19 00:06:40 CEST 2020
Nice save. And thanks for the detailed info.
On Thursday, June 18, 2020, Lindsay Mathieson <lindsay.mathieson at gmail.com>
wrote:
> Clean nautilous install I setup last week
>
> * 5 Proxmox nodes
> o All on latest updates via no-subscription channel
> * 18 OSD's
> * 3 Managers
> * 3 Monitors
> * Cluster Heal good
> * In a protracted rebalance phase
> * All managed via proxmox
>
> I thought I would enable telemetry for caph as per this article:
>
> https://docs.ceph.com/docs/master/mgr/telemetry/
>
>
> * Enabled the module (command line)
> * ceph telemetry on
> * Tested getting the status
> * Set the contact and description
> ceph config set mgr mgr/telemetry/contact 'John Doe
> <john.doe at example.com>'
> ceph config set mgr mgr/telemetry/description 'My first Ceph cluster'
> ceph config set mgr mgr/telemetry/channel_ident true
> * Tried sending it
> ceph telemetry send
>
> I *think* this is when the managers died, but it could have been earlier.
But around then the all ceph IO stopped and I discovered all three managers
had crashed and would not restart. I was shitting myself because this was
remote and the router is a pfSense VM :) Fortunately it kept going without
its disk responding.
>
> systemctl start ceph-mgr at vni.service
> Job for ceph-mgr at vni.service failed because the control process exited
with error code.
> See "systemctl status ceph-mgr at vni.service" and "journalctl -xe" for
details.
>
> From journalcontrol -xe
>
> -- The unit ceph-mgr at vni.service has entered the 'failed' state with
> result 'exit-code'.
> Jun 18 21:02:25 vni systemd[1]: Failed to start Ceph cluster manager
> daemon.
> -- Subject: A start job for unit ceph-mgr at vni.service has failed
> -- Defined-By: systemd
> -- Support: https://www.debian.org/support
> --
> -- A start job for unit ceph-mgr at vni.service has finished with a
> failure.
> --
> -- The job identifier is 91690 and the job result is failed.
>
>
> From systemctl status ceph-mgr at vni.service
>
> ceph-mgr at vni.service - Ceph cluster manager daemon
> Loaded: loaded (/lib/systemd/system/ceph-mgr at .service; enabled; vendor
preset: enabled)
> Drop-In: /lib/systemd/system/ceph-mgr at .service.d
> └─ceph-after-pve-cluster.conf
> Active: failed (Result: exit-code) since Thu 2020-06-18 20:53:52 AEST;
8min ago
> Process: 415566 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER}
--id vni --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
> Main PID: 415566 (code=exited, status=1/FAILURE)
>
> Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Service
RestartSec=10s expired, scheduling restart.
> Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Scheduled restart
job, restart counter is at 4.
> Jun 18 20:53:52 vni systemd[1]: Stopped Ceph cluster manager daemon.
> Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Start request
repeated too quickly.
> Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Failed with result
'exit-code'.
> Jun 18 20:53:52 vni systemd[1]: Failed to start Ceph cluster manager
daemon.
>
> I created a new manager service on an unused node and fortunately that
worked. I deleted/recreated the old managers and they started working. It
was a sweaty few minutes :)
>
>
> Everything resumed without a hiccup after that, impressed. Not game to
try and reproduce it though.
>
>
>
> --
> Lindsay
>
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
More information about the pve-user
mailing list