[PVE-User] Ceph bluestore OSD Journal/DB disk size

Eneko Lacunza elacunza at binovo.es
Wed May 29 13:37:18 CEST 2019


Hi Alwin,

El 29/5/19 a las 11:59, Alwin Antreich escribió:
> I have noticed that our office Proxmox cluster has a Bluestore OSD with a
>> very small db partition. This OSD was created from GUI on 12th march this
>> year:
>>
>> This node has 4 OSDs:
>> - osd.12: bluestore, all SSD
>> - osd.3: bluestore, SSD db + spinning
>> - osd.2: filestore, SSD journal + spinning
>> - osd.4: filestore, SSD journal + spinning
>>
>> We have two pools in the cluster (SSD and HDD).
>>
>> I see that for osd.3 block.db points to /dev/sdb8, which is 1G in size:
>>
>> # lsblk
>> NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
>> sda            8:0    0   1,8T  0 disk
>> ├─sda1         8:1    0   100M  0 part /var/lib/ceph/osd/ceph-12
>> └─sda2         8:2    0   1,8T  0 part
>> sdb            8:16   0 931,5G  0 disk
>> ├─sdb1         8:17   0   100M  0 part /var/lib/ceph/osd/ceph-3
>> └─sdb2         8:18   0 931,4G  0 part
>> sdc            8:32   0 931,5G  0 disk
>> └─sdc1         8:33   0 931,5G  0 part /var/lib/ceph/osd/ceph-4
>> sdd            8:48   0 931,5G  0 disk
>> └─sdd1         8:49   0 931,5G  0 part /var/lib/ceph/osd/ceph-2
>> sde            8:64   0 186,3G  0 disk
>> ├─sde1         8:65   0  1007K  0 part
>> ├─sde2         8:66   0   127M  0 part /boot/efi
>> ├─sde3         8:67   0  59,9G  0 part
>> │ ├─pve-root 253:0    0  18,6G  0 lvm  /
>> │ └─pve-swap 253:1    0   952M  0 lvm  [SWAP]
>> ├─sde5         8:69   0     5G  0 part
>> ├─sde6         8:70   0     5G  0 part
>> └─sde8         8:72   0     1G  0 part
>>
>> This was created from GUI. I see that currently GUI doesn't allow to specify
>> journal/DB partition size... (I can't test all the process until
>> creation...)
> Currently, yes. Ceph Nautilus (coming with PVE6) has many changes in
> store.
This is nice to know! :)

>
>> I think 1GB may be too small for a default value, and that it could be
>> preventing the full db to be placed in that partition, as per ceph-users
>> mailing list messages:
> This is Ceph's default setting and can be changed by adding
> bluestore_block_db_size, bluestore_block_wal_size values to the
> ceph.conf.
> https://forum.proxmox.com/threads/where-can-i-tune-journal-size-of-ceph-bluestore.44000/#post-210638
Thanks for the explanation. I don't know why ceph's default is 1GB 
really, but it seems it's taking 1% of the block device as per:
https://ceph.com/community/new-luminous-bluestore/

But if you read the links I sent, it seems that only sizes 
3GB/30GB/300GB are meaningfull for db, because rocksdb doesn't take 
advantage of the available space in other sizes (limits to one of 
those). So I think it would be useful that Proxmox uses a 3GB default 
size; no problems with migrations from filestore with journal (5GB was 
default for that) and also not a problem for OSD-dense servers...
>
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030740.html
>> https://www.spinics.net/lists/ceph-devel/msg39315.html
>>
>> Maybe 3GB would be a better default? Also it seems that for not very dense
>> OSD nodes 30GB (or whatever is the next level) would be feasible too.
>>
>> I see the following in a perf dump of osd.3
>>      "bluefs": {
>>          "gift_bytes": 0,
>>          "reclaim_bytes": 0,
>>          "db_total_bytes": 1073733632,
>>          "db_used_bytes": 664797184,
>>          "wal_total_bytes": 0,
>>          "wal_used_bytes": 0,
>>          "slow_total_bytes": 40004222976,
>>          "slow_used_bytes": 1228931072,
>>          "num_files": 19,
>>          "log_bytes": 1318912,
>>          "log_compactions": 1,
>>          "logged_bytes": 164077568,
>>          "files_written_wal": 2,
>>          "files_written_sst": 17,
>>          "bytes_written_wal": 1599916960,
>>          "bytes_written_sst": 752941742
>>      },
>>
>> So, 665MB used of db partition, and 1.2GB of additional data in slow
>> storage...
> The DB size will also vary with workload (eg. RBD, CephFS, EC...) and a
> default size might just not always work. But I have read (sadly can't
> find the link) that it should be possible with Nautilus to expand and
> migrate DB/WAL for offline OSDs. Making room for such optimizations.
>
> The general statement from Ceph is, that the block.db should be bigger
> than 4% (eg. 1TB >= 40G).
> http://docs.ceph.com/docs/luminous/rados/configuration/bluestore-config-ref/#sizing
But recent messages in ceph-user list seem to explain that documentation 
is wrong in this regard...

Cheers
Eneko

-- 
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es




More information about the pve-user mailing list