[PVE-User] How to configure the best for CEPH

Eneko Lacunza elacunza at binovo.es
Thu Mar 17 09:50:26 CET 2016


Hi Jean-Laurent,

El 16/03/16 a las 20:39, Jean-Laurent Ivars escribió:
> I have a 2 host cluster setup with ZFS and replicated on each other 
> with pvesync script among other things and my VMs are running on these 
> hosts for now but I am impatient to be able to migrate on my new 
> infrastructure. I decided to change my infrastructure because I really 
> would like to take advantage of CEPH for replication, expanding 
> abilities, live migration and even maybe high availability setup.
>
> After having read a lot of documentations/books/forums, I decided to 
> go with CEPH storage which seem to be the way to go for me.
>
> My servers are hosted by OVH and from what I read, and with the budget 
> I have, the best options with CEPH storage in mind seemed to be the 
> following servers : 
> https://www.ovh.com/fr/serveurs_dedies/details-servers.xml?range=HOST&id=2016-HOST-32H 
>
> With the following storage options : No HW Raid, 2X300Go SSD and 2X2To HDD
About the SSD, what exact brand/model are they? I can't find this info 
on OVH web.
>
> One of the reasons I choose these models is the 10Gb VRACK option and 
> I understood that CEPH needs a fast network to be efficient. Of course 
> in a perfect world, the best would be to have a lot of disks for OSDs, 
> two more SSD for my system and 2 10Gb bonded NIC but this is the most 
> approaching I can afford in the OVH product range.
In your configuration, I doubt very much you'll be able to leverage 10Gb 
NICs; I have a 3node 3osd each setup in our office, with 1 gbit network, 
and ceph hardly uses 200-300Mbps. Maybe you have a bit lower latency, 
but that will be all.
>
> I already made the install of the cluster and set different VLANs for 
> cluster and storage. Set the hosts files and installed CEPH. 
> Everything went seamless except the fact that OVH installation create 
> a MBR install on the SSD and CEPH needs a GPT one but I managed to 
> convert the partition tables so now, I though I was all set for CEPH 
> configuration.
>
> _For now, my partitioning scheme is the following :__(_message 
> rejected because too big for mailing list so there is a link) 
> https://www.ipgenius.fr/tools/pveceph.png

Seems quite good, maybe having a bit more room for root filesystem would 
be good, you have 300GB of disk... :) Also see below.

>
> I know that it would be better to give CEPH the whole disks but I have 
> to put my system somewhere… I was thinking that even if it’s not the 
> best (i can’t afford more), these settings would work… So I have tried 
> to give CEPH the OSDs with my SSD journal partition with the 
> appropriate command but it didn’t seem to work and I assume it's 
> because CEPH don’t want partitions but entire hard drive…
>
> root at pvegra1 ~ # pveceph createosd /dev/sdc -journal_dev /dev/sda4
> create OSD on /dev/sdc (xfs)
> using device '/dev/sda4' for journal
> Creating new GPT entries.
> GPT data structures destroyed! You may now partition the disk using 
> fdisk or
> other utilities.
> Creating new GPT entries.
> The operation has completed successfully.
> WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the 
> same device as the osd data
> WARNING:ceph-disk:Journal /dev/sda4 was not prepared with ceph-disk. 
> Symlinking directly.
> Setting name!
> partNum is 0
> REALLY setting name!
> The operation has completed successfully.
> meta-data=/dev/sdc1     isize=2048   agcount=4, agsize=122094597 blks
>          =     sectsz=512   attr=2, projid32bit=1
>          =     crc=0        finobt=0
> data     =     bsize=4096   blocks=488378385, imaxpct=5
>          =     sunit=0      swidth=0 blks
> naming   =version 2     bsize=4096   ascii-ci=0 ftype=0
> log      =internal log     bsize=4096   blocks=238466, version=2
>          =     sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none     extsz=4096   blocks=0, rtextents=0
> Warning: The kernel is still using the old partition table.
> The new table will be used at the next reboot.
> The operation has completed successfully.
>
> I saw the following threads :
> https://forum.proxmox.com/threads/ceph-server-feedback.17909/
> https://forum.proxmox.com/threads/ceph-server-why-block-devices-and-not-partitions.17863/
>
> But this kind of setting seem to suffer performance issue and It’s not 
> officially supported and I am not feeling very well with that because 
> at the moment, I only took community subscription from Proxmox but I 
> want to be able to move on a different plan to get support from them 
> if I need it and if I go this way, I’m afraid they will say me it’s a 
> non supported configuration.
>
> OVH can provide USB keys so I could install the system on it and get 
> my whole disks for CEPH, but I think it is not supported too. 
> Moreover, I fear for performances and stability in the time with this 
> solution.
>
> Maybe I could use one SSD for the system and journal partitions (but 
> again it’s a mix not really supported) and the other SSD dedicated to 
> CEPH… but with this solution I loose my system RAID protection… and a 
> lot of SSD space...
>
> I’m a little bit confused about the best partitioning scheme and how 
> to manage to obtain a stable, supported, which the less space lost and 
> performant configuration.
>
> Should I continue with my partitioning scheme even if it’s not the 
> best supported, it seem the most appropriate in my case or do I need 
> to completing rethink my install ?
>
> Please can someone give me advice, I’m all yours :)
> Thanks a lot for anyone taking the time to read this mail and giving 
> me good advices.
I suggest you only mirror swap and root partitions. Then use one SSD for 
earch OSD's journal.

So to fix your problems, please try the following:
- Remove all OSDs from Proxmox GUI (or CLI)
- Remove journal partitions
- Remove journal partition mirrors
- Now we have 2 partitions on each SSD (swap and root), mirrored.
- Create OSDs from Proxmox GUI, use a different SSD disk for journal of 
each OSD. If you can't do this, SSD drives don't have GPT partition.
>
> P.S. If someone from the official proxmox support team sees this 
> message can you tell me If I buy a subscription with ticket if I can 
> be assisted on this kind of question ?  And if I buy a subscription, I 
> will ask help to configure CEPH for the best too, SSD pool, normal 
> speed pool, how to set redundancy, how to make snapshots, how to make 
> backups and so on and so on… is it the kind of things you can help me 
> with ?
You need to first buy a subscription.

Good luck
Eneko

-- 
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943493611
       943324914
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.proxmox.com/pipermail/pve-user/attachments/20160317/8fb12028/attachment.htm>


More information about the pve-user mailing list