[PVE-User] Again, Ceph: default timeout for osd?

Fri Dec 16 10:14:45 CET 2016

I've done some tests in the past, but probably without noting that,
because the test system was... a test system, so mostly offloaded.

Yesterday i've had to reboot a ceph node, that was MON and with some
OSD.

I've set the flags:

 2016-12-15 17:01:29.139923 mon.0 10.27.251.7:6789/0 1213541 : cluster [INF] HEALTH_WARN; nodown,noout flag(s) set

and then i've reboot a node. Immediately a mon election start:

 2016-12-15 17:02:55.923980 mon.3 10.27.251.11:6789/0 861 : cluster [INF] mon.2 calling new monitor election
 2016-12-15 17:02:55.924373 mon.4 10.27.251.12:6789/0 932 : cluster [INF] mon.3 calling new monitor election
 2016-12-15 17:02:55.935396 mon.2 10.27.251.9:6789/0 767 : cluster [INF] mon.4 calling new monitor election
 2016-12-15 17:02:55.937804 mon.1 10.27.251.8:6789/0 1037 : cluster [INF] mon.1 calling new monitor election
 2016-12-15 17:03:00.963259 mon.1 10.27.251.8:6789/0 1038 : cluster [INF] mon.1 at 1 won leader election with quorum 1,2,3,4
 2016-12-15 17:03:00.974493 mon.1 10.27.251.8:6789/0 1039 : cluster [INF] HEALTH_WARN; nodown,noout flag(s) set; 1 mons down, quorum 1,2,3,4 1,4,2,3
 2016-12-15 17:03:00.993133 mon.1 10.27.251.8:6789/0 1040 : cluster [INF] monmap e5: 5 mons at {0=10.27.251.7:6789/0,1=10.27.251.8:6789/0,2=10.27.251.11:6789/0,3=10.27.251.12:6789/0,4=10.27.251.9:6789/0}
 2016-12-15 17:03:00.993751 mon.1 10.27.251.8:6789/0 1042 : cluster [INF] mdsmap e1: 0/0/0 up
 2016-12-15 17:03:00.994296 mon.1 10.27.251.8:6789/0 1043 : cluster [INF] osdmap e1457: 10 osds: 10 up, 10 in

but after that i've started to log row like that:

 2016-12-15 17:03:19.951569 osd.8 10.27.251.9:6808/2444 77 : cluster [WRN] 2 slow requests, 2 included below; oldest blocked for > 30.707671 secs
 2016-12-15 17:03:19.951577 osd.8 10.27.251.9:6808/2444 78 : cluster [WRN] slow request 30.707671 seconds old, received at 2016-12-15 17:02:49.243826: osd_op(client.7866004.0:21967625 rbd_data.21daf62ae8944a.0000000000000e0e [set-alloc-hint object_size 4194304 write_size 4194304,write 2002944~4096] 1.1a150784 ack+ondisk+write+known_if_redirected e1457) currently waiting for subops from 1
 2016-12-15 17:03:19.951582 osd.8 10.27.251.9:6808/2444 79 : cluster [WRN] slow request 30.295238 seconds old, received at 2016-12-15 17:02:49.656259: osd_op(client.7865380.0:25563538 rbd_data.4384f22ae8944a.0000000000004347 [set-alloc-hint object_size 4194304 write_size 4194304,write 1953792~4096] 1.c2cdeca ack+ondisk+write+known_if_redirected e1457) currently waiting for subops from 0
 2016-12-15 17:03:21.415662 mon.1 10.27.251.8:6789/0 1053 : cluster [INF] pgmap v3604380: 768 pgs: 768 active+clean; 984 GB data, 1964 GB used, 12932 GB / 14896 GB avail; 1336 B/s wr, 0 op/s

until the node came back.
In the time the server reboot, VMs get irresponsive, with load go sky
high.
Initially i've NOT noted that the mon election start immediately, but
OSD where not marked out/down.

So, after reading docs and logs, i've understood that:

1) clearly, ceph cannot mark an OSD down in miliseconds, so if an OSD
 go down, it is normal that io stalls until the system recognize that
the osd is down and redirect access elsewhere.

2) setting the 'nodown,noout flag(s)' i ''lock'' not the ability of
 ceph to recognize an OSD down/out, but the effect of that (eg,
rebalancing).

3) the default timeout of setting OSD out/down are, for me, absolutely
 far from a reasonable value. The example config
(/usr/share/doc/ceph/sample.ceph.conf.gz) say:
    # The number of seconds Ceph waits before marking a Ceph OSD
    # Daemon "down" and "out" if it doesn't respond.
    # Type: 32-bit Integer
    # (Default: 300)
    ;mon osd down out interval  = 300

    # The grace period in seconds before declaring unresponsive Ceph OSD
    # Daemons "down".
    # Type: 32-bit Integer
    # (Default: 900)
    ;mon osd report timeout          = 300

(the same in
http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/).

So, i've to wait 5 minutes, at the best options, to get an OSD really
marked down. In 5 miutes, the server get back from a reboot, so i've
never had an OSD down... but with all the VMs in stall!

Seems to me a totally unreasonable ''timeout''. I think a reasonable
value could be 5-15 seconds, but i'm confused and so i'm seeking
feedback.

Thanks.

-- 
dott. Marco Gaiarin				        GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''          http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

		Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
    http://www.lanostrafamiglia.it/25/index.php/component/k2/item/123
	(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)