[PVE-User] Poor CEPH performance? or normal?
Adam Thompson
athompso at athompso.net
Fri Jul 27 15:46:14 CEST 2018
On 2018-07-27 07:05, ronny+pve-user at aasen.cx wrote:
> rbd striping is a per image setting. you may need to make the rbd
> image and migrate data.
>
> On 07/26/18 12:25, Mark Adams wrote:
>> Thanks for your suggestions. Do you know if it is possible to change
>> an
>> existing rbd pool to striping? or does this have to be done on first
>> setup?
Please be aware that striping will not result in any increased
performance, if you are using "safe" I/O modes, i.e. your VM waits for a
successful flush-to-disk after every sector. In that scenario, CEPH
will never give you write performance equal to a local disk because
you're limited to the bandwidth of a single remote disk [subsystem]
*plus* the network round-trip latency, which even if measured in
microseconds, still adds up.
Based on my experience with this and other distributed storage systems,
I believe you will likely find that you get large write-performance
gains by:
1. use the largest possible block size during writes. 512B sectors are
the worst-case scenario for any remote storage. Try to write in chunks
of *at least* 1 MByte, and it's not unreasonable nowadays to write in
chunks of 64MB or larger. The rationale here is that you're spending
more time sending data, and less time waiting for ACKs. The more you
can tilt that in favor of data, the better off you are. (There are
downsides to huge sector/block/chunk sizes, though - this isn't a "free
lunch" scenario. See #5.)
2. relax your write-consistency requirements. If you can tolerate the
small risk with "Write Back" you should see better performance,
especially during burst writes. During large sequential writes, there
are not many ways to violate the laws of physics, and CEPH automatically
amplifies your writes by (in your case) a factor of 2x due to
replication.
3. switch to storage devices with the best possible local write speed,
for OSDs. OSDs are limited by the performance of the underlying device
or virtual device. (e.g. it's totally possible to run OSDs on a
hardware RAID6 controller)
4. Avoid CoW-on-CoW. Write amplification means you'll lose around 50%
of your IOPS and/or I/O bandwidth for each level of CoW nesting,
depending on workload. So don't put CEPH OSDs on, ssy, BTRFS or ZFS
filesystems. A worst-case scenario would be something like running a VM
using ZFS on top of CEPH, where the OSDs are located on BTRFS
filsystems, which are in turn virtual devices hosted on ZFS filesystems.
Welcome to 1980's storage performance, in that case! (I did it without
realizing once... seriously, 5 MBps sequential writes was a good day!)
FWIW, CoW filesystems are generally awesome - just not when stacked. A
sufficiently fast external NAS running ZFS with VMs stored over NFS can
provide decent performance, *if* tuned correctly. iX Systems, for
example, spends a lot of time & effort making this work well, including
some lovely HA NAS appliances.
5. Remember the triangle. You can optimize a distributed storage system
for any TWO of: a) cost, b) resiliency/reliability/HA, or c) speed.
(This is a specific case of the traditional good/fast/cheap:pick-any-2
adage.)
I'm not sure I'm saying anything new here, I may have just summarized
the discussion, but the points remain valid.
Good luck with your performance problems.
-Adam
More information about the pve-user
mailing list