[PVE-User] Ceph: Monitors not running but cannot be destroyed or recreated

Sun Jan 26 23:51:54 CET 2020

On 26/01/2020 16:46, Frank Thommen wrote:
> On 26/01/2020 14:14, Frank Thommen wrote:
>> Dear all,
>>
>> I am trying to destroy "old" Ceph monitors but they can't be deleted 
>> and also cannot be recreated:
>>
>> I am currently configuring Ceph on our PVE cluster (3 nodes running 
>> PVE 6.1-3).  There have been some "remainders" of a previous Ceph 
>> configuration which I had tried to configure while the nodes were not 
>> in a cluster configuration yet (and I had used the wrong network).  
>> However I had purged these configurations with `pveceph purge`.  I 
>> have redone the basic Ceph configuration through the GUI on the first 
>> node and I have deleted the still existing managers through the GUI 
>> (to have a fresh start).
>>
>> A new monitor has been created on the first node automatically, but I 
>> am unable to delete the monitors on nodes 2 and 3.  They show up as 
>> Status=stopped and Address=Unknown in the GUI and they cannot be 
>> started (no error message).  In the syslog window I see (after 
>> rebooting node odcf-pve02):
>>
>> ------------
>> Jan 26 13:51:53 odcf-pve02 systemd[1]: Started Ceph cluster monitor 
>> daemon.
>> Jan 26 13:51:55 odcf-pve02 ceph-mon[1372]: 2020-01-26 13:51:55.450 
>> 7faa98ab9280 -1 mon.odcf-pve02 at 0(electing) e1 failed to get devid for 
>> : fallback method has serial ''but no model
>> ------------
>>
>> On the other hand I see the same message on the first node, and there 
>> the monitor seems to work fine.
>>
>> Trying to destroy them results in the message, that there is no such 
>> monitor, and trying to create a new monitor on these nodes results in 
>> the message, that the monitor already exists.... I am stuck in this 
>> existence loop.  Destroying or creating them also doesn't work on the 
>> commandline.
>>
>> Any idea on how to fix this?  I'd rather not completely reinstall the 
>> nodes :-)
>>
>> Cheers
>> frank
> 
> 
> In an attempt to clean up the Ceph setup again, I ran
> 
>    pveceph stop ceph.target
>    pveceph purge
> 
> on the first node.  Now I get an
> 
>     rados_connect failed - No such file or directory (500)
> 
> when I select Ceph in the GUI of any of the three nodes.  A reboot of 
> all nodes didn't help.
> 
> frank

I was finally able to completely purge the old settings and reconfigure 
Ceph with the various instructions from this 
(https://forum.proxmox.com/threads/not-able-to-use-pveceph-purge-to-completely-remove-ceph.59606/) 
post.

Maybe this information could be added to the official documentation 
(unless there is a nicer way of completely resetting Ceph in a PROXMOX 
cluster)?

frank