[PVE-User] Spillover issue

Wed Mar 25 11:55:12 CET 2020

On Wed, Mar 25, 2020 at 08:43:41AM +0100, Eneko Lacunza wrote:
> Hi Alwin,
> 
> El 24/3/20 a las 14:54, Alwin Antreich escribió:
> > On Tue, Mar 24, 2020 at 01:12:03PM +0100, Eneko Lacunza wrote:
> > > Hi Allwin,
> > > 
> > > El 24/3/20 a las 12:24, Alwin Antreich escribió:
> > > > On Tue, Mar 24, 2020 at 10:34:15AM +0100, Eneko Lacunza wrote:
> > > > > We're seeing a spillover issue with Ceph, using 14.2.8:
> > > [...]
> > > > > 3. ceph health detail
> > > > >      HEALTH_WARN BlueFS spillover detected on 3 OSD
> > > > >      BLUEFS_SPILLOVER BlueFS spillover detected on 3 OSD
> > > > >      osd.3 spilled over 5 MiB metadata from 'db' device (556 MiB used of
> > > > >      6.0 GiB) to slow device
> > > > >      osd.4 spilled over 5 MiB metadata from 'db' device (552 MiB used of
> > > > >      6.0 GiB) to slow device
> > > > >      osd.5 spilled over 5 MiB metadata from 'db' device (551 MiB used of
> > > > >      6.0 GiB) to slow device
> > > > > 
> > > > > I may be overlooking something, any idea? Just found also the following ceph
> > > > > issue:
> > > > > 
> > > > > https://tracker.ceph.com/issues/38745
> > > > > 
> > > > > 5MiB of metadata in slow isn't a big problem, but cluster is permanently in
> > > > > health Warning state... :)
> > > > The DB/WAL device is to small and all the new metadata has to be written
> > > > to the slow device. This will destroy performance.
> > > > 
> > > > I think the size changes, as the DB gets compacted.
> > > Yes. But it isn't too small... it's 6 GiB and there's only ~560MiB of data.
> > Yes true. I meant the used of size. But the message is oddly.
> > 
> > You should find the compaction stats in the OSD log files. It could be,
> > as in the bug tracker reasoned, that the compaction needs to much space
> > and spills over to the slow device. Addionally, if no set extra, the WAL
> > will take up 512 MB on the DB device.
> I don't see any indication that compaction needs too much space:
> 
> 2020-03-24 14:24:04.883 7f03ffbee700  4 rocksdb: [db/db_impl.cc:777] -------
> DUMPING STATS -------
> 2020-03-24 14:24:04.883 7f03ffbee700  4 rocksdb: [db/db_impl.cc:778]
> ** DB Stats **
> Uptime(secs): 15000.1 total, 600.0 interval
> Cumulative writes: 4646 writes, 18K keys, 4646 commit groups, 1.0 writes per
> commit group, ingest: 0.01 GB, 0.00 MB/s
> Cumulative WAL: 4646 writes, 1891 syncs, 2.46 writes per sync, written: 0.01
> GB, 0.00 MB/s
> Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
> Interval writes: 163 writes, 637 keys, 163 commit groups, 1.0 writes per
> commit group, ingest: 0.63 MB, 0.00 MB/s
> Interval WAL: 163 writes, 67 syncs, 2.40 writes per sync, written: 0.00 MB,
> 0.00 MB/s
> Interval stall: 00:00:0.000 H:M:S, 0.0 percent
> 
> ** Compaction Stats [default] **
> Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB)
> Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt)
> Avg(sec) KeyIn KeyDrop
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>   L0      0/0    0.00 KB   0.0      0.0     0.0      0.0 0.0      0.0      
> 0.0   1.0      0.0     33.4 0.02              0.00         2    0.009      
> 0      0
>   L1      0/0    0.00 KB   0.0      0.0     0.0      0.0 0.0      0.0      
> 0.0   0.8    162.1    134.6 0.09              0.06         1    0.092   
> 127K    10K
>   L2      9/0   538.64 MB   0.2      0.5     0.0      0.5 0.5      0.0      
> 0.0  43.6    102.7    101.2 5.32              1.31         1    5.325  
> 1496K   110K
>  Sum      9/0   538.64 MB   0.0      0.5     0.0      0.5 0.5      0.0      
> 0.0 961.1    103.3    101.5 5.43              1.37         4    1.358  
> 1623K   121K
>  Int      0/0    0.00 KB   0.0      0.0     0.0      0.0 0.0      0.0      
> 0.0   0.0      0.0      0.0 0.00              0.00         0    0.000      
> 0      0
> 
> ** Compaction Stats [default] **
> Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB)
> Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec)
> Comp(cnt) Avg(sec) KeyIn KeyDrop
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Low      0/0    0.00 KB   0.0      0.5     0.0      0.5 0.5      0.0      
> 0.0   0.0    103.7    101.7 5.42              1.36         2    2.708  
> 1623K   121K
> High      0/0    0.00 KB   0.0      0.0     0.0      0.0 0.0      0.0      
> 0.0   0.0      0.0     43.9 0.01              0.00         1    0.013      
> 0      0
> User      0/0    0.00 KB   0.0      0.0     0.0      0.0 0.0      0.0      
> 0.0   0.0      0.0      0.4 0.00              0.00         1    0.004      
> 0      0
> Uptime(secs): 15000.1 total, 600.0 interval
> Flush(GB): cumulative 0.001, interval 0.000
> AddFile(GB): cumulative 0.000, interval 0.000
> AddFile(Total Files): cumulative 0, interval 0
> AddFile(L0 Files): cumulative 0, interval 0
> AddFile(Keys): cumulative 0, interval 0
> Cumulative compaction: 0.54 GB write, 0.04 MB/s write, 0.55 GB read, 0.04
> MB/s read, 5.4 seconds
> Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s
> read, 0.0 seconds
> Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0
> level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for
> pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0
> memtable_compaction, 0 memtable_slowdown, interval 0 total count
> 
> I see the following in a perf dump:
> 
>     "bluefs": {
>         "gift_bytes": 0,
>         "reclaim_bytes": 0,
>         "db_total_bytes": 6442442752,
>         "db_used_bytes": 696246272,
>         "wal_total_bytes": 0,
>         "wal_used_bytes": 0,
>         "slow_total_bytes": 40004222976,
>         "slow_used_bytes": 5242880,
>         "num_files": 20,
>         "log_bytes": 41631744,
>         "log_compactions": 0,
>         "logged_bytes": 40550400,
>         "files_written_wal": 2,
>         "files_written_sst": 41,
>         "bytes_written_wal": 102040973,
>         "bytes_written_sst": 2233090674,
>         "bytes_written_slow": 0,
>         "max_bytes_wal": 0,
>         "max_bytes_db": 1153425408,
>         "max_bytes_slow": 0,
>         "read_random_count": 127832,
>         "read_random_bytes": 2761102524,
>         "read_random_disk_count": 19206,
>         "read_random_disk_bytes": 2330400597,
>         "read_random_buffer_count": 108844,
>         "read_random_buffer_bytes": 430701927,
>         "read_count": 21457,
>         "read_bytes": 1087948189,
>         "read_prefetch_count": 21438,
>         "read_prefetch_bytes": 1086853927
>     },
> 
> 
> > If the above doesn't give any information then you may need to export
> > the bluefs (RocksDB). Then you can run the kvstore-tool on it.
> I'll look to try this, although I'd say it's some kind of bug.
> > 
> > > > The easiest way ist to destroy and re-create the OSD with a bigger
> > > > DB/WAL. The guideline from Facebook for RocksDB is 3/30/300 GB.
> > > It's well below the 3GiB limit in the guideline ;)
> > For now. ;)
> Cluster has 2 years now, data amount is quite stable, I think it will hold
> for some time ;)
Hm... Igor recons that this seems to be normal.
https://tracker.ceph.com/issues/38745#note-28

--
Cheers,
Alwin