[PVE-User] Ceph or Gluster

Sat Apr 23 02:36:41 CEST 2016

On 23/04/2016 7:50 AM, Brian :: wrote:
> With NVME journals on a 3 node 4 OSD cluster

Well your hardware is rather better than mine :) I'm just using consumer 
grade SSD's for journals which won't have anywhere near the performance 
of NVME

> if I do a quick dd of a
> 1GB file on a VM I can see 2.34Gbps on the storage network straight
> away so if I was only using 1Gbps here the network would be a
> bottlekneck. If I perform the same in 2 VMs traffic hits 4.19Gbps on
> the storage network.
>
> The throughput in the VM is 1073741824 bytes (1.1 GB) copied, 3.43556
> s, 313 MB/s (R=3)

dd isn't really a good test of throughput, to easy for the kernel and 
filesystem to optimise it. bonnie++ or even CrystalDiskMark (Windows VM) 
would be interesting.

>
> Would be very interested in hearing more about your gluster setup.. I
> don't know anything about it - how many nodes are involved?

POOMA U summary:

redhat offer two cluster filesystems, ceph and gluster - gluster 
actually predates ceph, though ceph definitely has more attention now.

     http://gluster.org,

gluster replicates a file system directly, whereas ceph rbd is a pure 
block based replication system (ignoring rgw etc). CephFS only reached 
stable in the latest release, but rbd is a good match for block based VM 
images. Like ceph, gluster has a direct block based interface for VM 
images (gfapi) integrated with qemu which offers better performance than 
fuse based filesystems.

One of the problem with gluster used to be its file based replication 
and healing process - it had no way of tracking block changes, so when a 
node was down and a large VM image was written to, it would have to scan 
and compare the entire multi GB file for changes when the node came back 
up. A none issue for ceph where block devices are stored in 4MB chunks 
and it tracks which chunks have changed.

However in vs 3.7 gluster introduced sharded volumes where files are 
stored in shards. shard size is configurable and defaults to 4MB. That 
has brought gluster heal performance and resource usage in into the same 
league as ceph, though ceph is still slightly faster I think.

One huge problem I've noticed with ceph is snapshot speed. For me via 
proxmox, ceph rbd live snapshots were unusably slow. Sluggish to take, 
but rolling back a snapshot would take literally hours. Same problem 
with restoring backups. Deal breaker for me. Gluster can use qcow2 
images and snapshot rollbacks would take a couple of minutes at worst.

My hardware setup:

3 Proxmox modes, VM's and ceph/gluster on all 3.

Node 1:
     - Xeon E5-2620
     - 64GB RAM
     - ZFS RAID10
        - SSD log & cache
        - 4 * 3TB WD Red
     - 3 * 1GB Eth

Node 2:
     - 2 * Xeon E5-2660
     - 64GB RAM
     - ZFS RAID10
        - SSD log & cache
        - 4 * 3TB WD Red
     - 3 * 1GB Eth

Node 3:
     - Xeon E5-2620
     - 64GB RAM
     -  ZFS RAID10,
         - SSD log & cache
         - 6 * 600GB Velocoraptor
         - 2 * 3TB WD Red
     - 2 * 1GB Eth

Originally ceph had all the disks to itself (xfs underneath), now ceph 
and gluster are both now running off ZFS pools while I evaluate gluster. 
Currently half the VM's are running off gluster. Not ideal as there is a 
certain amount of overhead in running both.

gluster - basically the same overall setup as ceph:
- replica 3
- 64MB shard size
- caching etc is all handled by ZFS

Crucial things for me:
- stability. Does it crash a lot :)
- Robustness, how well does it cope with node crashes, network outages etc
- performance - raw speed and IOPS
- snapshots. How easy is it to snapshot and rollback VM's. Not an issue 
for eveyone, but we run a lot of dev and testing VM's where easy access 
to multiple snaphots is important.
- backups. How easy to backup and *restore*.

cheers,

-- 
Lindsay Mathieson