[PVE-User] Poor CEPH performance? or normal?
Ronny Aasen
ronny+pve-user at aasen.cx
Wed Jul 25 20:20:34 CEST 2018
On 25. juli 2018 02:19, Mark Adams wrote:
> Hi All,
>
> I have a proxmox 5.1 + ceph cluster of 3 nodes, each with 12 x WD 10TB GOLD
> drives. Network is 10Gbps on X550-T2, separate network for the ceph cluster.
>
> I have 1 VM currently running on this cluster, which is debian stretch with
> a zpool on it. I'm zfs sending in to it, but only getting around ~15MiB/s
> write speed. does this sound right? it seems very slow to me.
>
> Not only that, but when this zfs send is running - I can not do any
> parallel sends to any other zfs datasets inside of the same VM. They just
> seem to hang, then eventually say "dataset is busy".
>
> Any pointers or insights greatly appreciated!
Greetings
alwin gave you some good advice about filesystems and vm's, i wanted to
say a little about ceph.
with 3 nodes, and the default and reccomended size=3 pools, you can not
tolerate any node failures. IOW, if you loose a node, or need to do
lengthy maintainance on it, you are running degraded. I allways have a
4th "failure domain" node. so my cluster can selfheal (one of cephs
killer features) from a node failure. your cluster should be
3+[how-many-node-failures-i-want-to-be-able-to-survive-and-still-operate-sanely]
spinning osd's with bluestore benefit greatly from ssd DB/WAL's if your
osd's have ondisk DB/WAL you can gain a lot of performance by having the
DB/WAL on a SSD or better.
ceph gains performance with scale(number of osd nodes) . so while ceph's
aggeregate performance is awesome, an individual single thread will not
be amazing. A given set of data will exist on all 3 nodes, and you will
hit 100% of nodes with any write. so by using ceph with 3 nodes you
give ceph the worst case for performance. eg
with 4 nodes a write would hit 75%, with 6 nodes it would hit 50% of the
cluster. you see where this is going...
But a single write will only hit one disk in 3 nodes, and will not have
a better performance then the disk it hits. you can cheat more
performance with rbd caching. and it is important for performance to get
a higher queue depth. afaik zfs uses a queue depth of 1, for ceph the
worst possible. you may have some success by buffering on one or both
ends of the transfer [1]
if the vm have a RBD disk, you may (or may not) benefit from rbd fancy
striping[2], since operations can hit more osd's in parallel.
good luck
Ronny Aasen
[1]
https://everycity.co.uk/alasdair/2010/07/using-mbuffer-to-speed-up-slow-zfs-send-zfs-receive/
[2] http://docs.ceph.com/docs/master/architecture/#data-striping
More information about the pve-user
mailing list