[PVE-User] ceph-osd not starting after network related issues
proxmox at iancoetzee.za.net
Wed Jul 3 08:35:01 CEST 2019
Some feedback on my end. I managed to recover the "lost data" from one of
the other OSDs. Seems like my initial summary was a bit off, in that the
PG's was replicated, CEPH just wanted to confirm that the objects were
For future reference, I basically marked the OSD as lost
> ceph osd lost <id>
Then the PGs went into an incomplete state
After that I temporarily set an option on the OSDs to ignore the history
(osd_find_best_info_ignore_history_les). Got the info from
After that CEPH was happy and started to rebalance the cluster, pheew,
This failure did however convince me to increase our cluster size from 2:1
to 3:2. Sacrificing usable space for reliability.
Now I need to give feedback on what happened, this is what I am still not
sure about as SMART does not show any sector errors. I might as well start
a badblocks and see if I detect anything in there.
As always, I am open to other suggestion as to where to look for other
clues on what went wrong.
On Mon, 1 Jul 2019 at 09:10, Ian Coetzee <proxmox at iancoetzee.za.net> wrote:
> Hi All,
> This morning I have a bit of a big boo-boo on our production system.
> After a very sudden network outage somewhere during the night, one of my
> ceph-osd's is no longer starting up.
> If I try and start it manually, I get a very spectacular failure, see link.
> As near as I can tell, it seems to be asserting whether a file exsists, I
> have yet to determine which file that would be. Any pointers are welcome,
> as well as any other ideas to get the osd back. For some reason there is
> data on the osd that was not replicated to my other osd's, as such I can
> not just re-init this osd as some of the posts I could find suggests
> I am also going to head to the ceph ML in a bit (after I have registered)
> Kind regards
More information about the pve-user