[PVE-User] Again, Ceph: default timeout for osd?

Fri Dec 16 15:22:40 CET 2016

Marco,

I think you've answered the nodown/noout question.

As for the "total unreasonable value" for the default..  In my
experience the "defaults" become the defaults in 1 of 2 primary ways.
The upstream vendor (ceph in this case) has a default that they select
based on their expected typical use case, and the downstream vendor
didn't override it,
 OR
the downstream changes it to match the typical expected use-case

Either way,   as with most things in the unix world, the defaults
aren't for everyone, which is why you can tune them.   If the defaults
aren't suitable for you, feel free to change them in your environment.

On Fri, Dec 16, 2016 at 6:47 AM, Marco Gaiarin <gaio at sv.lnf.it> wrote:
> Mandi! Alexandre DERUMIER
>   In chel di` si favelave...
>
>> >>mon osd down out interval
>> This is the time between when a monitor marks an OSD "down" (not
>> currently serving data) and "out" (not considered *responsible* for
>> data by the cluster). IO will resume once the OSD is down (assuming
>> the PG has its minimum number of live replicas); it's just that data
>> will be re-replicated to other nodes once an OSD is marked "out".
>
> Seems clear to me. I try to make an example to be sure.
>
> If i set:
>         mon osd report timeout = 15
>         mon osd down out interval = 300
>
> happen:
>
> a) after 15 seconds, irresponsive OSD get 'down', so IO resume
>
> b) after 5 minutes, the OSD get marked 'out', and so rebalancing
>  start.
>
> I've still a doubt. If i set 'ceph osd set nodown', simply i put the
> first timeout to 'never'? Explained as above, could be... and so it is
> my fault that i've set the 'nodown'...
>
> Wait...
>         http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-June/002438.html
>
> ok, seems that 'noout' flag is the right thing to do. 'nodown' have to
> be used only in 'bouncing' situation.
>
> If simply i need to stop rebalancing, it suffices to set 'noout'.
>
>
>> osd should go down in around 30s max. (in this time, the cluster will be stale)..
>> but not 5min.
>
> My experience say no. And if the parameter is 'mon osd report timeout',
> also the docs say '300'.
> Seems to me a total unreasonable value...
>
>
>> (in ceph kraken, they have done optimisation for this detection
>>  https://github.com/ceph/ceph/pull/8558)
>
> Interesting. This is not my case, anyway, because i've rebooted all the
> server.
>
> --
> dott. Marco Gaiarin                                     GNUPG Key ID: 240A3D66
>   Associazione ``La Nostra Famiglia''          http://www.lanostrafamiglia.it/
>   Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
>   marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797
>
>                 Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
>     http://www.lanostrafamiglia.it/25/index.php/component/k2/item/123
>         (cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

-- 
Jeff Palmer
https://PalmerIT.net