[PVE-User] Spillover issue

Tue Mar 24 12:24:23 CET 2020

Hello Eneko,

On Tue, Mar 24, 2020 at 10:34:15AM +0100, Eneko Lacunza wrote:
> Hi all,
> 
> We're seeing a spillover issue with Ceph, using 14.2.8:
> 
> We originally had 1GB rocks.db partition:
> 
> 1. ceph health detail
>    HEALTH_WARN BlueFS spillover detected on 3 OSD
>    BLUEFS_SPILLOVER BlueFS spillover detected on 3 OSD
>    osd.3 spilled over 78 MiB metadata from 'db' device (1024 MiB used
>    of 1024 MiB) to slow device
>    osd.4 spilled over 78 MiB metadata from 'db' device (1024 MiB used
>    of 1024 MiB) to slow device
>    osd.5 spilled over 84 MiB metadata from 'db' device (1024 MiB used
>    of 1024 MiB) to slow device
> 
> We have created new 6GiB partitions for rocks.db, copied the original
> partition, then extended it with "ceph-bluestore-tool bluefs-bdev-expand".
> Now we get:
> 
> 1. ceph health detail
>    HEALTH_WARN BlueFS spillover detected on 3 OSD
>    BLUEFS_SPILLOVER BlueFS spillover detected on 3 OSD
>    osd.3 spilled over 5 MiB metadata from 'db' device (555 MiB used of
>    6.0 GiB) to slow device
>    osd.4 spilled over 5 MiB metadata from 'db' device (552 MiB used of
>    6.0 GiB) to slow device
>    osd.5 spilled over 5 MiB metadata from 'db' device (561 MiB used of
>    6.0 GiB) to slow device
> 
> Issuing "ceph daemon osd.X compact" doesn't help, but shows the following
> transitional state:
> 
> 1. ceph daemon osd.5 compact {
>    "elapsed_time": 5.4560688339999999
>    }
> 2. ceph health detail
>    HEALTH_WARN BlueFS spillover detected on 3 OSD
>    BLUEFS_SPILLOVER BlueFS spillover detected on 3 OSD
>    osd.3 spilled over 5 MiB metadata from 'db' device (556 MiB used of
>    6.0 GiB) to slow device
>    osd.4 spilled over 5 MiB metadata from 'db' device (552 MiB used of
>    6.0 GiB) to slow device
>    osd.5 spilled over 5 MiB metadata from 'db' device (1.1 GiB used of
>    6.0 GiB) to slow device
>    (...and after a while...)
> 3. ceph health detail
>    HEALTH_WARN BlueFS spillover detected on 3 OSD
>    BLUEFS_SPILLOVER BlueFS spillover detected on 3 OSD
>    osd.3 spilled over 5 MiB metadata from 'db' device (556 MiB used of
>    6.0 GiB) to slow device
>    osd.4 spilled over 5 MiB metadata from 'db' device (552 MiB used of
>    6.0 GiB) to slow device
>    osd.5 spilled over 5 MiB metadata from 'db' device (551 MiB used of
>    6.0 GiB) to slow device
> 
> I may be overlooking something, any idea? Just found also the following ceph
> issue:
> 
> https://tracker.ceph.com/issues/38745
> 
> 5MiB of metadata in slow isn't a big problem, but cluster is permanently in
> health Warning state... :)
The DB/WAL device is to small and all the new metadata has to be written
to the slow device. This will destroy performance.

I think the size changes, as the DB gets compacted.

The easiest way ist to destroy and re-create the OSD with a bigger
DB/WAL. The guideline from Facebook for RocksDB is 3/30/300 GB.

--
Cheers,
Alwin