[PVE-User] Ceph: Some trouble creating OSD with journal on a sotware raid device...

Alwin Antreich sysadmin-pve at cognitec.com
Thu Oct 13 14:07:00 CEST 2016

Hi Marco,

On 10/13/2016 12:56 PM, Marco Gaiarin wrote:
> Mandi! Alwin Antreich
>   In chel di` si favelave...
>> I have to ask a more general question here, why are you putting the journal on a RAID1?
> For safety?
>> For better performance and less
>> complexity the journal should reside on standalone SSDs. With the RAID1 you limit the speed of the journal, then it
>> would be better to reside the journal on the OSD disks itself.
> I know that. But i'm setting up a little ceph cluster, using as network
> backend only gigabit ethernet, so my bottleneck is mostly the 50GB/s of
> the network.
> Also, i cannot efford to buy a SSD for every OSD, and using the same SSD
> for many/all the OSD in the box is a big SPoF.

To use a SSD for OSD journals is no spof, of course, when the SSD fails then the OSDs connected to this SSD will be
down, but the cluster will recover the data that is not redundant anymore onto the remaining OSDs in the cluster,
auto-magicly. That's the same thing as if the whole machine would die, then everything would need to recover the same way.

The ratio for SDD journal to OSD disks, can be done by a simple dd test. Take the write speed of the SSD and divide it
through the write speed of the OSD, then you will get the max number of disks that can be used on one SSD, without
expecting the SSD to be the bottleneck.

eg: 500MB/s write speed for the SSD and divide it by OSD write speed, 100MB/s = 5

> So i'm using (software) raid1, confident enought that the penalty of
> the raid cannot impact so much overral.

As in the text above, I wouldn't recommend adding another layer in between. Makes harder to troubleshoot.

>>> The proxmox correctly see the 4 OSD candidate disks, but does not see the
>>> journal partition. So i've used commandline:
>> pveceph is a wrapper around ceph tools and a dependency for pveceph is smartmontools. So mdadm doesn't list smart
>> attributes and this might be why it's not seeing it. But this is more a guess and should be verified by someone who
>> knows better.
> I suppose that.
> I'm a bit unconfident with the error/warning message printed ad the
> general behaviour.
> AFAI've understood, ceph can use for journal disks, partition and even
> files.
> Probably 'md' devices are nor partition nor disks, and ceph get
> confused.
> Would be better, for example, to simply put journal on files? Eg,
> format the md device, mount it and create inside the journal files?

You can use files as journal disks, but if recall right, then there was a thread on the ceph mailing list discussing
this and its limitations.

> Thanks.


More information about the pve-user mailing list