[PVE-User] Ceph bluestore OSD Journal/DB disk size
Alwin Antreich
a.antreich at proxmox.com
Wed May 29 11:59:48 CEST 2019
Hi Eneko,
On Wed, May 29, 2019 at 10:30:33AM +0200, Eneko Lacunza wrote:
> Hi all,
>
> I have noticed that our office Proxmox cluster has a Bluestore OSD with a
> very small db partition. This OSD was created from GUI on 12th march this
> year:
>
> This node has 4 OSDs:
> - osd.12: bluestore, all SSD
> - osd.3: bluestore, SSD db + spinning
> - osd.2: filestore, SSD journal + spinning
> - osd.4: filestore, SSD journal + spinning
>
> We have two pools in the cluster (SSD and HDD).
>
> I see that for osd.3 block.db points to /dev/sdb8, which is 1G in size:
>
> # lsblk
> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> sda 8:0 0 1,8T 0 disk
> ├─sda1 8:1 0 100M 0 part /var/lib/ceph/osd/ceph-12
> └─sda2 8:2 0 1,8T 0 part
> sdb 8:16 0 931,5G 0 disk
> ├─sdb1 8:17 0 100M 0 part /var/lib/ceph/osd/ceph-3
> └─sdb2 8:18 0 931,4G 0 part
> sdc 8:32 0 931,5G 0 disk
> └─sdc1 8:33 0 931,5G 0 part /var/lib/ceph/osd/ceph-4
> sdd 8:48 0 931,5G 0 disk
> └─sdd1 8:49 0 931,5G 0 part /var/lib/ceph/osd/ceph-2
> sde 8:64 0 186,3G 0 disk
> ├─sde1 8:65 0 1007K 0 part
> ├─sde2 8:66 0 127M 0 part /boot/efi
> ├─sde3 8:67 0 59,9G 0 part
> │ ├─pve-root 253:0 0 18,6G 0 lvm /
> │ └─pve-swap 253:1 0 952M 0 lvm [SWAP]
> ├─sde5 8:69 0 5G 0 part
> ├─sde6 8:70 0 5G 0 part
> └─sde8 8:72 0 1G 0 part
>
> This was created from GUI. I see that currently GUI doesn't allow to specify
> journal/DB partition size... (I can't test all the process until
> creation...)
Currently, yes. Ceph Nautilus (coming with PVE6) has many changes in
store.
>
> I think 1GB may be too small for a default value, and that it could be
> preventing the full db to be placed in that partition, as per ceph-users
> mailing list messages:
This is Ceph's default setting and can be changed by adding
bluestore_block_db_size, bluestore_block_wal_size values to the
ceph.conf.
https://forum.proxmox.com/threads/where-can-i-tune-journal-size-of-ceph-bluestore.44000/#post-210638
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030740.html
> https://www.spinics.net/lists/ceph-devel/msg39315.html
>
> Maybe 3GB would be a better default? Also it seems that for not very dense
> OSD nodes 30GB (or whatever is the next level) would be feasible too.
>
> I see the following in a perf dump of osd.3
> "bluefs": {
> "gift_bytes": 0,
> "reclaim_bytes": 0,
> "db_total_bytes": 1073733632,
> "db_used_bytes": 664797184,
> "wal_total_bytes": 0,
> "wal_used_bytes": 0,
> "slow_total_bytes": 40004222976,
> "slow_used_bytes": 1228931072,
> "num_files": 19,
> "log_bytes": 1318912,
> "log_compactions": 1,
> "logged_bytes": 164077568,
> "files_written_wal": 2,
> "files_written_sst": 17,
> "bytes_written_wal": 1599916960,
> "bytes_written_sst": 752941742
> },
>
> So, 665MB used of db partition, and 1.2GB of additional data in slow
> storage...
The DB size will also vary with workload (eg. RBD, CephFS, EC...) and a
default size might just not always work. But I have read (sadly can't
find the link) that it should be possible with Nautilus to expand and
migrate DB/WAL for offline OSDs. Making room for such optimizations.
The general statement from Ceph is, that the block.db should be bigger
than 4% (eg. 1TB >= 40G).
http://docs.ceph.com/docs/luminous/rados/configuration/bluestore-config-ref/#sizing
--
Cheers,
Alwin
More information about the pve-user
mailing list