[PVE-User] Again, Ceph: default timeout for osd?

Fri Dec 16 10:47:12 CET 2016

>>mon osd down out interval

This is the time between when a monitor marks an OSD "down" (not
currently serving data) and "out" (not considered *responsible* for
data by the cluster). IO will resume once the OSD is down (assuming
the PG has its minimum number of live replicas); it's just that data
will be re-replicated to other nodes once an OSD is marked "out".

osd should go down in around 30s max. (in this time, the cluster will be stale)..
but not 5min.

I think this can be tunable, don't remember the value.

(in ceph kraken, they have done optimisation for this detection
 https://github.com/ceph/ceph/pull/8558)

----- Mail original -----
De: "Marco Gaiarin" <gaio at sv.lnf.it>
À: "proxmoxve" <pve-user at pve.proxmox.com>
Envoyé: Vendredi 16 Décembre 2016 10:14:45
Objet: [PVE-User] Again, Ceph: default timeout for osd?

I've done some tests in the past, but probably without noting that, 
because the test system was... a test system, so mostly offloaded. 

Yesterday i've had to reboot a ceph node, that was MON and with some 
OSD. 

I've set the flags: 

2016-12-15 17:01:29.139923 mon.0 10.27.251.7:6789/0 1213541 : cluster [INF] HEALTH_WARN; nodown,noout flag(s) set 

and then i've reboot a node. Immediately a mon election start: 

2016-12-15 17:02:55.923980 mon.3 10.27.251.11:6789/0 861 : cluster [INF] mon.2 calling new monitor election 
2016-12-15 17:02:55.924373 mon.4 10.27.251.12:6789/0 932 : cluster [INF] mon.3 calling new monitor election 
2016-12-15 17:02:55.935396 mon.2 10.27.251.9:6789/0 767 : cluster [INF] mon.4 calling new monitor election 
2016-12-15 17:02:55.937804 mon.1 10.27.251.8:6789/0 1037 : cluster [INF] mon.1 calling new monitor election 
2016-12-15 17:03:00.963259 mon.1 10.27.251.8:6789/0 1038 : cluster [INF] mon.1 at 1 won leader election with quorum 1,2,3,4 
2016-12-15 17:03:00.974493 mon.1 10.27.251.8:6789/0 1039 : cluster [INF] HEALTH_WARN; nodown,noout flag(s) set; 1 mons down, quorum 1,2,3,4 1,4,2,3 
2016-12-15 17:03:00.993133 mon.1 10.27.251.8:6789/0 1040 : cluster [INF] monmap e5: 5 mons at {0=10.27.251.7:6789/0,1=10.27.251.8:6789/0,2=10.27.251.11:6789/0,3=10.27.251.12:6789/0,4=10.27.251.9:6789/0} 
2016-12-15 17:03:00.993751 mon.1 10.27.251.8:6789/0 1042 : cluster [INF] mdsmap e1: 0/0/0 up 
2016-12-15 17:03:00.994296 mon.1 10.27.251.8:6789/0 1043 : cluster [INF] osdmap e1457: 10 osds: 10 up, 10 in 

but after that i've started to log row like that: 

2016-12-15 17:03:19.951569 osd.8 10.27.251.9:6808/2444 77 : cluster [WRN] 2 slow requests, 2 included below; oldest blocked for > 30.707671 secs 
2016-12-15 17:03:19.951577 osd.8 10.27.251.9:6808/2444 78 : cluster [WRN] slow request 30.707671 seconds old, received at 2016-12-15 17:02:49.243826: osd_op(client.7866004.0:21967625 rbd_data.21daf62ae8944a.0000000000000e0e [set-alloc-hint object_size 4194304 write_size 4194304,write 2002944~4096] 1.1a150784 ack+ondisk+write+known_if_redirected e1457) currently waiting for subops from 1 
2016-12-15 17:03:19.951582 osd.8 10.27.251.9:6808/2444 79 : cluster [WRN] slow request 30.295238 seconds old, received at 2016-12-15 17:02:49.656259: osd_op(client.7865380.0:25563538 rbd_data.4384f22ae8944a.0000000000004347 [set-alloc-hint object_size 4194304 write_size 4194304,write 1953792~4096] 1.c2cdeca ack+ondisk+write+known_if_redirected e1457) currently waiting for subops from 0 
2016-12-15 17:03:21.415662 mon.1 10.27.251.8:6789/0 1053 : cluster [INF] pgmap v3604380: 768 pgs: 768 active+clean; 984 GB data, 1964 GB used, 12932 GB / 14896 GB avail; 1336 B/s wr, 0 op/s 

until the node came back. 
In the time the server reboot, VMs get irresponsive, with load go sky 
high. 
Initially i've NOT noted that the mon election start immediately, but 
OSD where not marked out/down. 

So, after reading docs and logs, i've understood that: 

1) clearly, ceph cannot mark an OSD down in miliseconds, so if an OSD 
go down, it is normal that io stalls until the system recognize that 
the osd is down and redirect access elsewhere. 

2) setting the 'nodown,noout flag(s)' i ''lock'' not the ability of 
ceph to recognize an OSD down/out, but the effect of that (eg, 
rebalancing). 

3) the default timeout of setting OSD out/down are, for me, absolutely 
far from a reasonable value. The example config 
(/usr/share/doc/ceph/sample.ceph.conf.gz) say: 
# The number of seconds Ceph waits before marking a Ceph OSD 
# Daemon "down" and "out" if it doesn't respond. 
# Type: 32-bit Integer 
# (Default: 300) 
;mon osd down out interval = 300 

# The grace period in seconds before declaring unresponsive Ceph OSD 
# Daemons "down". 
# Type: 32-bit Integer 
# (Default: 900) 
;mon osd report timeout = 300 

(the same in 
http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/). 

So, i've to wait 5 minutes, at the best options, to get an OSD really 
marked down. In 5 miutes, the server get back from a reboot, so i've 
never had an OSD down... but with all the VMs in stall! 

Seems to me a totally unreasonable ''timeout''. I think a reasonable 
value could be 5-15 seconds, but i'm confused and so i'm seeking 
feedback. 

Thanks. 

-- 
dott. Marco Gaiarin GNUPG Key ID: 240A3D66 
Associazione ``La Nostra Famiglia'' http://www.lanostrafamiglia.it/ 
Polo FVG - Via della Bontà, 7 - 33078 - San Vito al Tagliamento (PN) 
marco.gaiarin(at)lanostrafamiglia.it t +39-0434-842711 f +39-0434-842797 

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA! 
http://www.lanostrafamiglia.it/25/index.php/component/k2/item/123 
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA) 
_______________________________________________ 
pve-user mailing list 
pve-user at pve.proxmox.com 
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user