[PVE-User] CEPH: How to remove an OSD without experiencing inactive placement groups

Fri Dec 19 19:46:49 CET 2014

Hi Eneko,

Thank you. From what I'm reading, those options affect the amount of concurrent recovery that is happening. Forgive my ignorance, but how does it address the 78 placement groups which were inactive from the beginning of the process and past the end of the process? 

My google for the following doesn't turn up much:
"stuck inactive" "osd max backfills" "osd recovery max active"

I don't understand why these would become 'stuck inactive' until I brought the OSD up again. If it were a case of the lack of IO in the pool getting in the way of recovery (which I can understand with only nine disks), why were there 78 pgs inactive from the beginning, then (presumably the same) 78 at the end? I might expect in that situation that the VMs would be slow, and at the end of the process or part-way through when the IO has subsided, CEPH would decide that they become one of the active states again. I'm not familiar with the inner workings of CEPH and they are probably complex enough to just go over my head anyway; just trying to understand roughly what it's chosen to do there and why. I can see why those tunables might improve the responsiveness during the recovery process though.

Thanks,
Chris 

-----Original Message-----
From: pve-user [mailto:pve-user-bounces at pve.proxmox.com] On Behalf Of Eneko Lacunza
Sent: 19 December 2014 11:56
To: pve-user at pve.proxmox.com
Subject: Re: [PVE-User] CEPH: How to remove an OSD without experiencing inactive placement groups

Hi Chris,

The problem you reported is quite common in small Ceph clusters.

I suggest tuning the following in /etc/pve/ceph.conf in [osd] section:

      osd max backfills = 1
      osd recovery max active = 1

This should make the recovery "slower" and thus should make VMs responsive. Recovery will still be noticeable though.

Cheers
Eneko

On 19/12/14 12:36, Chris Murray wrote:
> That would make sense. Thank you Dietmar, I'll give them a try.
>
> Sent from my HTC
>
> ----- Reply message -----
> From: "Dietmar Maurer" <dietmar at proxmox.com>
> To: "pve-user at pve.proxmox.com" <pve-user at pve.proxmox.com>, "Chris 
> Murray" <chrismurray84 at gmail.com>
> Subject: [PVE-User] CEPH: How to remove an OSD without experiencing 
> inactive placement groups
> Date: Fri, Dec 19, 2014 10:48
>
>
>> I understand this is an emerging technology with active development; 
>> just want to check I'm not missing anything obvious or haven't 
>> fundamentally misunderstood how it works. I didn't expect the loss of
>> 1/9 of the devices in the pool to cease IO, especially when every 
>> object exists three times.
> This looks like a crush related problem to me. crush maps sometime 
> have problems with small setups (also see crush tunables). But I 
> suggest to ask that on the ceph list.
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2015.0.5577 / Virus Database: 4253/8761 - Release Date: 
> 12/18/14 _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>

--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943575997
       943493611
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es

_______________________________________________
pve-user mailing list
pve-user at pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2015.0.5577 / Virus Database: 4253/8757 - Release Date: 12/18/14