[PVE-User] pve-user Digest, Vol 147, Issue 10
Oleksii Tokovenko
atokovenko at gmail.com
Mon Jun 22 21:57:31 CEST 2020
unsubscribe
пт, 19 черв. 2020 о 13:00 <pve-user-request at pve.proxmox.com> пише:
> Send pve-user mailing list submissions to
> pve-user at pve.proxmox.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> or, via email, send a message with subject or body 'help' to
> pve-user-request at pve.proxmox.com
>
> You can reach the person managing the list at
> pve-user-owner at pve.proxmox.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of pve-user digest..."
>
>
> Today's Topics:
>
> 1. Enabling telemetry broke all my ceph managers (Lindsay Mathieson)
> 2. Re: Enabling telemetry broke all my ceph managers (Brian :)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 18 Jun 2020 21:30:38 +1000
> From: Lindsay Mathieson <lindsay.mathieson at gmail.com>
> To: PVE User List <pve-user at pve.proxmox.com>
> Subject: [PVE-User] Enabling telemetry broke all my ceph managers
> Message-ID: <a6481a31-5d59-c13c-dea2-5367842c21e7 at gmail.com>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> Clean nautilous install I setup last week
>
> * 5 Proxmox nodes
> o All on latest updates via no-subscription channel
> * 18 OSD's
> * 3 Managers
> * 3 Monitors
> * Cluster Heal good
> * In a protracted rebalance phase
> * All managed via proxmox
>
> I thought I would enable telemetry for caph as per this article:
>
> https://docs.ceph.com/docs/master/mgr/telemetry/
>
>
> * Enabled the module (command line)
> * ceph telemetry on
> * Tested getting the status
> * Set the contact and description
> ceph config set mgr mgr/telemetry/contact 'John Doe
> <john.doe at example.com>'
> ceph config set mgr mgr/telemetry/description 'My first Ceph cluster'
> ceph config set mgr mgr/telemetry/channel_ident true
> * Tried sending it
> ceph telemetry send
>
> I *think* this is when the managers died, but it could have been
> earlier. But around then the all ceph IO stopped and I discovered all
> three managers had crashed and would not restart. I was shitting myself
> because this was remote and the router is a pfSense VM :) Fortunately it
> kept going without its disk responding.
>
> systemctl start ceph-mgr at vni.service
> Job for ceph-mgr at vni.service failed because the control process exited
> with error code.
> See "systemctl status ceph-mgr at vni.service" and "journalctl -xe" for
> details.
>
> From journalcontrol -xe
>
> -- The unit ceph-mgr at vni.service has entered the 'failed' state with
> result 'exit-code'.
> Jun 18 21:02:25 vni systemd[1]: Failed to start Ceph cluster manager
> daemon.
> -- Subject: A start job for unit ceph-mgr at vni.service has failed
> -- Defined-By: systemd
> -- Support: https://www.debian.org/support
> --
> -- A start job for unit ceph-mgr at vni.service has finished with a
> failure.
> --
> -- The job identifier is 91690 and the job result is failed.
>
>
> From systemctl status ceph-mgr at vni.service
>
> ceph-mgr at vni.service - Ceph cluster manager daemon
> ?? Loaded: loaded (/lib/systemd/system/ceph-mgr at .service; enabled;
> vendor preset: enabled)
> ? Drop-In: /lib/systemd/system/ceph-mgr at .service.d
> ?????????? ??ceph-after-pve-cluster.conf
> ?? Active: failed (Result: exit-code) since Thu 2020-06-18 20:53:52
> AEST; 8min ago
> ? Process: 415566 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER}
> --id vni --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
> ?Main PID: 415566 (code=exited, status=1/FAILURE)
>
> Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Service
> RestartSec=10s expired, scheduling restart.
> Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Scheduled restart
> job, restart counter is at 4.
> Jun 18 20:53:52 vni systemd[1]: Stopped Ceph cluster manager daemon.
> Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Start request
> repeated too quickly.
> Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Failed with result
> 'exit-code'.
> Jun 18 20:53:52 vni systemd[1]: Failed to start Ceph cluster manager
> daemon.
>
> I created a new manager service on an unused node and fortunately that
> worked. I deleted/recreated the old managers and they started working.
> It was a sweaty few minutes :)
>
>
> Everything resumed without a hiccup after that, impressed. Not game to
> try and reproduce it though.
>
>
>
> --
> Lindsay
>
>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 18 Jun 2020 23:06:40 +0100
> From: "Brian :" <brians at iptel.co>
> To: PVE User List <pve-user at pve.proxmox.com>
> Subject: Re: [PVE-User] Enabling telemetry broke all my ceph managers
> Message-ID:
> <CAGPQfi_xwebe=
> MeekoDhoLN1s30BKX9cDdiEdJVLFvvQZH733Q at mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> Nice save. And thanks for the detailed info.
>
> On Thursday, June 18, 2020, Lindsay Mathieson <lindsay.mathieson at gmail.com
> >
> wrote:
> > Clean nautilous install I setup last week
> >
> > * 5 Proxmox nodes
> > o All on latest updates via no-subscription channel
> > * 18 OSD's
> > * 3 Managers
> > * 3 Monitors
> > * Cluster Heal good
> > * In a protracted rebalance phase
> > * All managed via proxmox
> >
> > I thought I would enable telemetry for caph as per this article:
> >
> > https://docs.ceph.com/docs/master/mgr/telemetry/
> >
> >
> > * Enabled the module (command line)
> > * ceph telemetry on
> > * Tested getting the status
> > * Set the contact and description
> > ceph config set mgr mgr/telemetry/contact 'John Doe
> > <john.doe at example.com>'
> > ceph config set mgr mgr/telemetry/description 'My first Ceph cluster'
> > ceph config set mgr mgr/telemetry/channel_ident true
> > * Tried sending it
> > ceph telemetry send
> >
> > I *think* this is when the managers died, but it could have been earlier.
> But around then the all ceph IO stopped and I discovered all three managers
> had crashed and would not restart. I was shitting myself because this was
> remote and the router is a pfSense VM :) Fortunately it kept going without
> its disk responding.
> >
> > systemctl start ceph-mgr at vni.service
> > Job for ceph-mgr at vni.service failed because the control process exited
> with error code.
> > See "systemctl status ceph-mgr at vni.service" and "journalctl -xe" for
> details.
> >
> > From journalcontrol -xe
> >
> > -- The unit ceph-mgr at vni.service has entered the 'failed' state with
> > result 'exit-code'.
> > Jun 18 21:02:25 vni systemd[1]: Failed to start Ceph cluster manager
> > daemon.
> > -- Subject: A start job for unit ceph-mgr at vni.service has failed
> > -- Defined-By: systemd
> > -- Support: https://www.debian.org/support
> > --
> > -- A start job for unit ceph-mgr at vni.service has finished with a
> > failure.
> > --
> > -- The job identifier is 91690 and the job result is failed.
> >
> >
> > From systemctl status ceph-mgr at vni.service
> >
> > ceph-mgr at vni.service - Ceph cluster manager daemon
> > Loaded: loaded (/lib/systemd/system/ceph-mgr at .service; enabled;
> vendor
> preset: enabled)
> > Drop-In: /lib/systemd/system/ceph-mgr at .service.d
> > ??ceph-after-pve-cluster.conf
> > Active: failed (Result: exit-code) since Thu 2020-06-18 20:53:52 AEST;
> 8min ago
> > Process: 415566 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER}
> --id vni --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
> > Main PID: 415566 (code=exited, status=1/FAILURE)
> >
> > Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Service
> RestartSec=10s expired, scheduling restart.
> > Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Scheduled restart
> job, restart counter is at 4.
> > Jun 18 20:53:52 vni systemd[1]: Stopped Ceph cluster manager daemon.
> > Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Start request
> repeated too quickly.
> > Jun 18 20:53:52 vni systemd[1]: ceph-mgr at vni.service: Failed with result
> 'exit-code'.
> > Jun 18 20:53:52 vni systemd[1]: Failed to start Ceph cluster manager
> daemon.
> >
> > I created a new manager service on an unused node and fortunately that
> worked. I deleted/recreated the old managers and they started working. It
> was a sweaty few minutes :)
> >
> >
> > Everything resumed without a hiccup after that, impressed. Not game to
> try and reproduce it though.
> >
> >
> >
> > --
> > Lindsay
> >
> > _______________________________________________
> > pve-user mailing list
> > pve-user at pve.proxmox.com
> > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> >
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
>
> ------------------------------
>
> End of pve-user Digest, Vol 147, Issue 10
> *****************************************
>
--
С уважением,
Токовенко Алексей Алексеевич
More information about the pve-user
mailing list