[PVE-User] (Very) basic question regarding PVE Ceph integration

Mon Dec 17 13:20:50 CET 2018

On 12/17/18 9:23 AM, Eneko Lacunza wrote:
> Hi,
> 
> El 16/12/18 a las 17:16, Frank Thommen escribió:
>>>> I understand that with the new PVE release PVE hosts (hypervisors) 
>>>> can be
>>>> used as Ceph servers.  But it's not clear to me if (or when) that makes
>>>> sense.  Do I really want to have Ceph MDS/OSD on the same hardware 
>>>> as my
>>>> hypervisors?  Doesn't that a) accumulate multiple POFs on the same 
>>>> hardware
>>>> and b) occupy computing resources (CPU, RAM), that I'd rather use 
>>>> for my VMs
>>>> and containers?  Wouldn't I rather want to have a separate Ceph 
>>>> cluster?
>>> The integration of Ceph services in PVE started with Proxmox VE 3.0.
>>> With PVE 5.3 (current) we added CephFS services to the PVE. So you can
>>> run a hyper-converged Ceph with RBD/CephFS on the same servers as your
>>> VM/CT.
>>>
>>> a) can you please be more specific in what you see as multiple point of
>>> failures?
>>
>> not only I run the hypervisor which controls containers and virtual 
>> machines on the server, but also the fileservice which is used to 
>> store the VM and container images.
> I think you have less points of failure :-) because you'll have 3 points 
> (nodes) of failure in an hyperconverged scenario and 6 in a separate 
> virtualization/storage cluster scenario...  it depends how you look at it.

Right, but I look at it from the service side: one hardware failure -> 
one service affected vs. one hardware failure -> two service affected.

>>> b) depends on the workload of your nodes. Modern server hardware has
>>> enough power to be able to run multiple services. It all comes down to
>>> have enough resources for each domain (eg. Ceph, KVM, CT, host).
>>>
>>> I recommend to use a simple calculation for the start, just to get a
>>> direction.
>>>
>>> In principle:
>>>
>>> ==CPU==
>>> core='CPU with HT on'
>>>
>>> * reserve a core for each Ceph daemon
>>>    (preferable on the same NUMA as the network; higher frequency is
>>>    better)
>>> * one core for the network card (higher frequency = lower latency)
>>> * rest of the cores for OS (incl. monitoring, backup, ...), KVM/CT usage
>>> * don't overcommit
>>>
>>> ==Memory==
>>> * 1 GB per TB of used disk space on an OSD (more on recovery)
> Note this is not true anymore with Bluestore, because you have to add 
> cache space into account (1GB for HDD and 3GB for SSD OSDs if I recall 
> correctly.), and also currently OSD processes aren't that good with RAM 
> use accounting... :)
>>> * enough memory for KVM/CT
>>> * free memory for OS, backup, monitoring, live migration
>>> * don't overcommit
>>>
>>> ==Disk==
>>> * one OSD daemon per disk, even disk sizes throughout the cluster
>>> * more disks, more hosts, better distribution
>>>
>>> ==Network==
>>> * at least 10 GbE for storage traffic (more the better),
>>>    see our benchmark paper
>>> https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/ 
>>>
> 10Gbit helps a lot with latency; small clusters can work perfectly with 
> 2x1Gbit if they aren't latency-sensitive (we have been running a 
> handfull of those for some years now).

I will keep the two points in mind.  Thank you.
frank