[PVE-User] Again, Ceph: default timeout for osd?

Marco Gaiarin gaio at sv.lnf.it
Fri Dec 16 12:47:31 CET 2016

Mandi! Alexandre DERUMIER
> >>mon osd down out interval
> This is the time between when a monitor marks an OSD "down" (not
> currently serving data) and "out" (not considered *responsible* for
> data by the cluster). IO will resume once the OSD is down (assuming
> the PG has its minimum number of live replicas); it's just that data
> will be re-replicated to other nodes once an OSD is marked "out".

Seems clear to me. I try to make an example to be sure.

If i set:
	mon osd report timeout = 15
	mon osd down out interval = 300


a) after 15 seconds, irresponsive OSD get 'down', so IO resume

b) after 5 minutes, the OSD get marked 'out', and so rebalancing

I've still a doubt. If i set 'ceph osd set nodown', simply i put the
first timeout to 'never'? Explained as above, could be... and so it is
my fault that i've set the 'nodown'...


ok, seems that 'noout' flag is the right thing to do. 'nodown' have to
be used only in 'bouncing' situation.

If simply i need to stop rebalancing, it suffices to set 'noout'.

> osd should go down in around 30s max. (in this time, the cluster will be stale)..
> but not 5min.

My experience say no. And if the parameter is 'mon osd report timeout',
also the docs say '300'.
Seems to me a total unreasonable value...

> (in ceph kraken, they have done optimisation for this detection
>  https://github.com/ceph/ceph/pull/8558)

Interesting. This is not my case, anyway, because i've rebooted all the

