[PVE-User] Poor CEPH performance? or normal?

Wed Jul 25 20:20:34 CEST 2018

On 25. juli 2018 02:19, Mark Adams wrote:
> Hi All,
> 
> I have a proxmox 5.1 + ceph cluster of 3 nodes, each with 12 x WD 10TB GOLD
> drives. Network is 10Gbps on X550-T2, separate network for the ceph cluster.
> 
> I have 1 VM currently running on this cluster, which is debian stretch with
> a zpool on it. I'm zfs sending in to it, but only getting around ~15MiB/s
> write speed. does this sound right? it seems very slow to me.
> 
> Not only that, but when this zfs send is running - I can not do any
> parallel sends to any other zfs datasets inside of the same VM. They just
> seem to hang, then eventually say "dataset is busy".
> 
> Any pointers or insights greatly appreciated!

Greetings

alwin gave you some good advice about filesystems and vm's, i wanted to 
say a little about ceph.

with 3 nodes, and the default and reccomended size=3 pools, you can not 
tolerate any node failures. IOW, if you loose a node, or need to do 
lengthy maintainance on it, you are running degraded. I allways have a 
4th "failure domain" node. so my cluster can selfheal (one of cephs 
killer features)  from a node failure. your cluster should be 
3+[how-many-node-failures-i-want-to-be-able-to-survive-and-still-operate-sanely]

spinning osd's with bluestore benefit greatly from ssd DB/WAL's if your 
osd's have ondisk DB/WAL you can gain a lot of performance by having the 
DB/WAL on a SSD or better.

ceph gains performance with scale(number of osd nodes) . so while ceph's 
aggeregate performance is awesome, an individual single thread will not 
be amazing. A given set of data will exist on all 3 nodes, and you will 
hit 100% of nodes with any write.  so by using ceph with 3 nodes you 
give ceph the worst case for performance. eg
with 4 nodes a write would hit 75%, with 6 nodes it would hit 50% of the 
cluster. you see where this is going...

But a single write will only hit one disk in 3 nodes, and will not have 
a better performance then the disk it hits. you can cheat more 
performance with rbd caching. and it is important for performance to get 
a higher queue depth. afaik zfs uses a queue depth of 1, for ceph the 
worst possible. you may have some success by buffering on one or both 
ends of the transfer [1]

if the vm have a RBD disk, you may (or may not) benefit from rbd fancy 
striping[2],  since operations can hit more osd's in parallel.

good luck
Ronny Aasen

[1] 
https://everycity.co.uk/alasdair/2010/07/using-mbuffer-to-speed-up-slow-zfs-send-zfs-receive/
[2] http://docs.ceph.com/docs/master/architecture/#data-striping