[PVE-User] Ceph + ZFS?

Wed Jun 1 23:46:50 CEST 2016

On 2/06/2016 6:10 AM, Jeremy McCoy wrote:
> I am new here and am working on designing a Proxmox cluster. Just wondering if
> anyone has tried doing Ceph on ZFS (instead of XFS on LVM or whatever pveceph
> sets up) and, if so, how you went about implementing it. I have 4 hosts that
> each have 1 spinner and 1 SSD to offer to the cluster.

Been there ... ceph does not like zfs, apparently COW filestems perform 
badly for cephs workload. Its recommended against (as is ext4 btw).

ceph does not do well on small setups, with only one osd/disk per node 
you will get *terrible* performance.

The 9.x versions of ceph have a latency bug that triggers under memory 
pressure (common on compute/storage nodes) and 10.x IMO is even less 
friendly to small setups. Also as you've probabluy noiced, its a 
maintenance headache for small shops, especially when things go wrong.

>
> Are there any pitfalls to be aware of here? My goal is to mainly run LXC
> containers (plus a few KVM VMs) on distributed storage, and I was hoping to take
> advantage of ZFS's caching, compression, and data integrity features. I am also
> open to doing GlusterFS or something else, but it looked like Proxmox does not
> support LXC containers running on that yet.

Probably because lxc doesn't support the native gluster api (gfapi), I 
imagine the same problem with ceph/rbd.

However there is also the gluster fuse mount that proxmox automatic 
creates (/mnt/pve/<glusterid>), you should be able to set that up as 
shared directory storage and use that with lxc. I'll have a test of that 
myself later today.

gluster works pretty with with zfs, I get excellent performance, maxing 
out my network for writes and much better iops than I was getting with 
ceph. Enabling lz4 compression gave me a 33% saving on space with no 
noticeable impact on performance.

If you can afford it I recommend using ZFS RAID1 or better yet, ZFS 
RAID10 per node. Apart from the extra redundancy it had much better 
read/write performance than a single disk. And its much easier to 
replace a failed disk on a zfs mirror than it is to replace a failed 
gluster or ceph node.

-- 
Lindsay Mathieson