[PVE-User] CEPH: How to remove an OSD without experiencing inactive placement groups

Fri Dec 19 11:31:43 CET 2014

I'm starting to familiarise myself with CEPH, and am very impressed with
how it's been packaged into Proxmox. Very easy to set up and administer,
thank you. This may be a CEPH question at heart, but I'll ask here in
case it's related to the implementation in Proxmox.

I might be misunderstanding in/out/up/down, but what is the correct
procedure for OSD removal?

I have three hosts, each with 3 OSDs. In addition to the usual three
pools, there's an additional 'vmpool' pool. All four have size=3 and
min_size=1. There's quite a difference in disk sizes and possibly
different degrees of health.

There's a mapping to 'vmpool' from another Proxmox cluster, upon which
some virtual machines live.

So, the pool works, but I want to remove OSD.0 on the first CEPH node.

I mark the OSD as 'down' and 'out' (although which I did first I can't
remember), and a load of IO starts and VMs become unresponsive. They
aren't very busy virtual machines.

'ceph status' looks as follows. Note the 78 stuck inactive placement
groups.

    cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
     health HEALTH_WARN 48 pgs backfill; 2 pgs backfilling; 340 pgs
degraded; 20 pgs recovering; 123 pgs recovery_wait; 78 pgs stuck
inactive; 613 pgs stuck unclean; 20 requests are blocked > 32 sec;
recovery 158823/691378 objects degraded (22.972%)
     monmap e3: 3 mons at
{0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0},
election epoch 50, quorum 0,1,2 0,1,2
     osdmap e572: 9 osds: 8 up, 8 in
      pgmap v96969: 1216 pgs, 4 pools, 888 GB data, 223 kobjects
            2249 GB used, 7486 GB / 9736 GB avail
            158823/691378 objects degraded (22.972%)
                   8 active+recovering+remapped
                  78 inactive
                  72 active+recovery_wait
                 603 active+clean
                   2 active+degraded+remapped+backfilling
                  12 active+recovering
                 290 active+degraded
                  52 active+remapped
                  51 active+recovery_wait+remapped
                  48 active+degraded+remapped+wait_backfill
recovery io 17591 kB/s, 4 objects/s

I leave this overnight, and find that the same 78 remain when the
process has apparently finished.

    cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
     health HEALTH_WARN 290 pgs degraded; 78 pgs stuck inactive; 496 pgs
stuck unclean; 4 requests are blocked > 32 sec; recovery 69696/685356
objects degraded (10.169%)
     monmap e3: 3 mons at
{0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0},
election epoch 50, quorum 0,1,2 0,1,2
     osdmap e669: 9 osds: 8 up, 8 in
      pgmap v100175: 1216 pgs, 4 pools, 888 GB data, 223 kobjects
            2408 GB used, 7327 GB / 9736 GB avail
            69696/685356 objects degraded (10.169%)
                  78 inactive
                 720 active+clean
                 290 active+degraded
                 128 active+remapped

I started the OSD to bring it back 'up'. It's still 'out'.

    cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
     health HEALTH_WARN 59 pgs degraded; 496 pgs stuck unclean; recovery
30513/688554 objects degraded (4.431%)
     monmap e3: 3 mons at
{0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0},
election epoch 50, quorum 0,1,2 0,1,2
     osdmap e671: 9 osds: 9 up, 8 in
      pgmap v103181: 1216 pgs, 4 pools, 892 GB data, 224 kobjects
            2408 GB used, 7327 GB / 9736 GB avail
            30513/688554 objects degraded (4.431%)
                 720 active+clean
                  59 active+degraded
                 437 active+remapped
  client io 2303 kB/s rd, 153 kB/s wr, 85 op/s

No pgs marked inactive now. I stop the OSD. It's now 'down' and 'out'
again, as it was earlier. At this point, I start my virtual machines
again, which now function.

    cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
     health HEALTH_WARN 368 pgs degraded; 496 pgs stuck unclean;
recovery 83332/688554 objects degraded (12.102%)
     monmap e3: 3 mons at
{0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0},
election epoch 50, quorum 0,1,2 0,1,2
     osdmap e673: 9 osds: 8 up, 8 in
      pgmap v103248: 1216 pgs, 4 pools, 892 GB data, 224 kobjects
            2408 GB used, 7327 GB / 9736 GB avail
            83332/688554 objects degraded (12.102%)
                 720 active+clean
                 368 active+degraded
                 128 active+remapped
  client io 19845 B/s wr, 6 op/s

Remove the OSD, and activity starts to move data around, as I'd expect.
The VMs are slow but they're working, which is good. :-)

    cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
     health HEALTH_WARN 35 pgs backfill; 8 pgs backfilling; 43 pgs
degraded; 17 pgs recovering; 122 pgs recovery_wait; 631 pgs stuck
unclean; 1 requests are blocked > 32 sec; recovery 295039/709243 objects
degraded (41.599%)
     monmap e3: 3 mons at
{0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0},
election epoch 50, quorum 0,1,2 0,1,2
     osdmap e690: 8 osds: 8 up, 8 in
      pgmap v103723: 1216 pgs, 4 pools, 892 GB data, 224 kobjects
            2412 GB used, 7323 GB / 9736 GB avail
            295039/709243 objects degraded (41.599%)
                 401 active
                 122 active+recovery_wait
                  13 active+degraded+remapped
                 567 active+clean
                  11 active+remapped+wait_backfill
                   6 active+degraded+remapped+backfilling
                  17 active+recovering
                  53 active+remapped
                  24 active+degraded+remapped+wait_backfill
                   2 active+remapped+backfilling
recovery io 197 MB/s, 49 objects/s
  client io 7721 B/s wr, 2 op/s

--------

My question is: what is the correct procedure for removing an OSD, and
why would the actions above have rendered placement groups temporarily
'blocked' for want of a better word, when there were other replicas of
data available in the pool (and must have been for the process to
ultimately complete). What if the same sequence of actions happened
during an actual failure, but it was not possible to start the OSD to
bring it back 'up' first? E.g. disk failure then entire host failure.

I understand this is an emerging technology with active development;
just want to check I'm not missing anything obvious or haven't
fundamentally misunderstood how it works. I didn't expect the loss of
1/9 of the devices in the pool to cease IO, especially when every object
exists three times.

Thanks in advance,
Chris