Marco Gaiarin gaio at sv.lnf.it
Mon Nov 28 13:05:11 CET 2016

A very strange saturday evening. Hardware tooling, hacking, caffeine,

I'm still completing my CEPH storage cluster (now 2 node storage,
waiting to add the third), but is it mostly ''on production''.
So, after playing with server for some month, saturday i've shut down
all the cluster, setup all the cables, switches, UPS, ... in a more
decent and stable way.

To simulate a hard power outgage, i've not set the noout and nodown

After that, i've powered up all the cluster (first the 2 ceph storage
node, after the 2 pve host nodes) and i've hit the first trouble:

	2016-11-26 18:17:29.901353 mon.0 1218 : cluster [INF] HEALTH_WARN; clock skew detected on mon.1, mon.2; 1 mons down, quorum 0,1,2 0,1,2; Monitor clock skew detected 

The trouble came from the fact that... my NTP server was on a VM, and
despite the fact that the status was only 'HEALTH_WARN', i cannot
access anymore the storage.

I've solved adding more NTP server from other sites, and after some
time the cluster go OK:

	2016-11-26 19:11:33.343818 mon.0 1581 : cluster [INF] HEALTH_OK

and here the panic start.

PVE interface report the Ceph cluster OK, report correctly all the stuffs
(mon, osd, pools, pool usage, ...) but data cluster was not accessible:

 a) if i try to move a disk, reply with something like 'no available'.

 b) if i try to start VMs, they stalls...

The only strange things on log was that there's NO pgmap update, like

	2016-11-26 16:59:31.588695 mon.0 2317560 : cluster [INF] pgmap v2410540: 768 pgs: 768 active+clean; 936 GB data, 1858 GB used, 7452 GB / 9310 GB avail; 13569 kB/s rd, 2731 kB/s wr, 565 op/s

but really, on panic, i've not noted that.

After some tests, i've finally do the right thing.

 1) i've set the noout and nodown flags.

 2) i've rebooted the ceph nodes, one by one.

After that, all the cluster start. VMs that was on stalls, immediately

After that, i've understood that NTP is a crucial service for ceph, so
it is needed to have a pool of servers. Still, i'm not sure this was
the culprit.

The second thing i've understood is that Ceph react badly to a total
shutdown. In a datacenter this is probably acceptable.

I don't know if it is my fault, or at least there's THE RIGTH WAY to
start a Ceph cluster from cold metal...


