[PVE-User] Enabling telemetry broke all my ceph managers

Fri Jun 19 00:06:40 CEST 2020

Nice save. And thanks for the detailed info.

On Thursday, June 18, 2020, Lindsay Mathieson <lindsay.mathieson at gmail.com>
wrote:
> Clean nautilous install I setup last week
>
>  * 5 Proxmox nodes
>      o All on latest updates via no-subscription channel
>  * 18 OSD's
>  * 3 Managers
>  * 3 Monitors
>  * Cluster Heal good
>  * In a protracted rebalance phase
>  * All managed via proxmox
>
> I thought I would enable telemetry for caph as per this article:
>
> https://docs.ceph.com/docs/master/mgr/telemetry/
>
>
>  * Enabled the module (command line)
>  * ceph telemetry on
>  * Tested getting the status
>  * Set the contact and description
>    ceph config set mgr mgr/telemetry/contact 'John Doe
>    <john.doe at example.com>'
>    ceph config set mgr mgr/telemetry/description 'My first Ceph cluster'
>    ceph config set mgr mgr/telemetry/channel_ident true
>  * Tried sending it
>    ceph telemetry send
>
> I *think* this is when the managers died, but it could have been earlier.
But around then the all ceph IO stopped and I discovered all three managers
had crashed and would not restart. I was shitting myself because this was
remote and the router is a pfSense VM :) Fortunately it kept going without
its disk responding.
>
> systemctl start ceph-mgr at vni.service
> Job for ceph-mgr at vni.service failed because the control process exited
with error code.
> See "systemctl status ceph-mgr at vni.service" and "journalctl -xe" for
details.
>
> From journalcontrol -xe
>
>    -- The unit ceph-mgr at vni.service has entered the 'failed' state with
>    result 'exit-code'.
>    Jun 18 21:02:25 vni systemd[1]: Failed to start Ceph cluster manager
>    daemon.
>    -- Subject: A start job for unit ceph-mgr at vni.service has failed
>    -- Defined-By: systemd
>    -- Support: https://www.debian.org/support
>    --
>    -- A start job for unit ceph-mgr at vni.service has finished with a
>    failure.
>    --
>    -- The job identifier is 91690 and the job result is failed.
>
>
> From systemctl status ceph-mgr at vni.service
>
> ceph-mgr at vni.service - Ceph cluster manager daemon
>    Loaded: loaded (/lib/systemd/system/ceph-mgr at .service; enabled; vendor
preset: enabled)
>   Drop-In: /lib/systemd/system/ceph-mgr at .service.d
>            └─ceph-after-pve-cluster.conf
>    Active: failed (Result: exit-code) since Thu 2020-06-18 20:53:52 AEST;
8min ago
>   Process: 415566 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER}
--id vni --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
>  Main PID: 415566 (code=exited, status=1/FAILURE)
>
> Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Service
RestartSec=10s expired, scheduling restart.
> Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Scheduled restart
job, restart counter is at 4.
> Jun 18 20:53:52 vni systemd[1]: Stopped Ceph cluster manager daemon.
> Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Start request
repeated too quickly.
> Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Failed with result
'exit-code'.
> Jun 18 20:53:52 vni systemd[1]: Failed to start Ceph cluster manager
daemon.
>
> I created a new manager service on an unused node and fortunately that
worked. I deleted/recreated the old managers and they started working. It
was a sweaty few minutes :)
>
>
> Everything resumed without a hiccup after that, impressed. Not game to
try and reproduce it though.
>
>
>
> --
> Lindsay
>
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>