[PVE-User] CEPH: How to remove an OSD without experiencing inactive placement groups

Tue Dec 30 11:45:52 CET 2014

Hi All,

Sorry for the delay getting back, I've been on holiday and yesterday was 
too busy to catch up with the list.

On 19/12/14 19:54, Adam Thompson wrote:
> On 14-12-19 12:46 PM, Chris Murray wrote:
>>
>> Thank you. From what I'm reading, those options affect the amount of 
>> concurrent recovery that is happening. Forgive my ignorance, but how 
>> does it address the 78 placement groups which were inactive from the 
>> beginning of the process and past the end of the process?
>>
>> My google for the following doesn't turn up much:
>> "stuck inactive" "osd max backfills" "osd recovery max active"
>>
>> I don't understand why these would become 'stuck inactive' until I 
>> brought the OSD up again. If it were a case of the lack of IO in the 
>> pool getting in the way of recovery (which I can understand with only 
>> nine disks), why were there 78 pgs inactive from the beginning, then 
>> (presumably the same) 78 at the end? I might expect in that situation 
>> that the VMs would be slow, and at the end of the process or part-way 
>> through when the IO has subsided, CEPH would decide that they become 
>> one of the active states again. I'm not familiar with the inner 
>> workings of CEPH and they are probably complex enough to just go over 
>> my head anyway; just trying to understand roughly what it's chosen to 
>> do there and why. I can see why those tunables might improve the 
>> responsiveness during the recovery process though.
>
> AFAIK you're exactly right about those settings.
> What I found was the only way to work around it was to adjust the 
> "size" and "min_size" pool options to "1" before removing the OSD, 
> then set them back to whatever you wanted after OSD removal.
> I think what's happening is that CEPH is noticing that there are a 
> bunch of pages that, while replicated elsewhere, are still valid, that 
> are now offline... not 100% sure.
>
> I wish sheepdog would hurry up and mature, it's much less complicated 
> for small-scale situations (1<n<32 hosts) like you and I are running.  
> After ignoring multiple warnings from Proxmox staff, I configured 
> sheepdog, saw fantastic performance (esp. compared to CEPH) and ... 
> promptly got burned when the next update changed the metadata format 
> with *no* in-place upgrade option.  (But until then it was awesome.)
>
> CEPH is a solid option, and I'm glad PVE includes it, but it's very 
> big and complex and cumbersome for low disk-count, low host-count 
> setups.  (E.g. I have 4 hosts, with 2 OSDs each.  CEPH isn't really 
> designed to scale down that small, at least not very well.)
>
I have experienced similar problems, but having pool size=3 then 
changing it to size=2, ceph won't show HEALTH_OK. I had to change 
size=1, then change back to size=2 too to get a HEALTH_OK.

I think that due to Ceph storage being developed and tested for much 
larger setups (in nodes and disks), with small setups we're hitting some 
rough/corner cases. :(

Anyway I like what I've seen so far, and integration in Proxmox is also 
very convenient (haven't checked sheepdog/glusterfs).

Cheers
Eneko

-- 
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943575997
       943493611
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es