[PVE-User] Ceph: Monitors not running but cannot be destroyed or recreated

Sun Jan 26 16:46:08 CET 2020

On 26/01/2020 14:14, Frank Thommen wrote:
> Dear all,
> 
> I am trying to destroy "old" Ceph monitors but they can't be deleted and 
> also cannot be recreated:
> 
> I am currently configuring Ceph on our PVE cluster (3 nodes running PVE 
> 6.1-3).  There have been some "remainders" of a previous Ceph 
> configuration which I had tried to configure while the nodes were not in 
> a cluster configuration yet (and I had used the wrong network).  However 
> I had purged these configurations with `pveceph purge`.  I have redone 
> the basic Ceph configuration through the GUI on the first node and I 
> have deleted the still existing managers through the GUI (to have a 
> fresh start).
> 
> A new monitor has been created on the first node automatically, but I am 
> unable to delete the monitors on nodes 2 and 3.  They show up as 
> Status=stopped and Address=Unknown in the GUI and they cannot be started 
> (no error message).  In the syslog window I see (after rebooting node 
> odcf-pve02):
> 
> ------------
> Jan 26 13:51:53 odcf-pve02 systemd[1]: Started Ceph cluster monitor daemon.
> Jan 26 13:51:55 odcf-pve02 ceph-mon[1372]: 2020-01-26 13:51:55.450 
> 7faa98ab9280 -1 mon.odcf-pve02 at 0(electing) e1 failed to get devid for : 
> fallback method has serial ''but no model
> ------------
> 
> On the other hand I see the same message on the first node, and there 
> the monitor seems to work fine.
> 
> Trying to destroy them results in the message, that there is no such 
> monitor, and trying to create a new monitor on these nodes results in 
> the message, that the monitor already exists.... I am stuck in this 
> existence loop.  Destroying or creating them also doesn't work on the 
> commandline.
> 
> Any idea on how to fix this?  I'd rather not completely reinstall the 
> nodes :-)
> 
> Cheers
> frank

In an attempt to clean up the Ceph setup again, I ran

   pveceph stop ceph.target
   pveceph purge

on the first node.  Now I get an

    rados_connect failed - No such file or directory (500)

when I select Ceph in the GUI of any of the three nodes.  A reboot of 
all nodes didn't help.

frank