[pve-devel] [PATCH docs DRAFT] Add section with more infos about ZFS RAID levels

Stoiko Ivanov s.ivanov at proxmox.com
Mon Jul 20 20:30:39 CEST 2020


Thanks for picking this up! Looking forward to not searching the web/our
forum for the good answers to questions that come up quite often.

a few mostly stylistic (as in more a matter of my taste) comments inline:

On Fri, 17 Jul 2020 14:12:32 +0200
Aaron Lauterer <a.lauterer at proxmox.com> wrote:

> This new section explains the performance and failure properties of
> mirror and RAIDZ VDEVs as well as the "unexpected" higher space usage by
> ZVOLs on a RAIDZ.
> 
> Signed-off-by: Aaron Lauterer <a.lauterer at proxmox.com>
> ---
> 
> This is a first draft to explain the performance characteristics of the
> different RAID levels / VDEV types, as well as their failure behavior.

> 
> Additionally it explains the situation why a VM disk (ZVOL) can end up
> using quite a bit more space than expected when placed on a pool made of
> RAIDZ VDEVs.
> 
> The motivation behind this is, that in the recent past, these things
> came up quite a bit. Thus it would be nice to have some documentation
> that we can link to and having it in the docs might help users to make
> an informed decision from the start.
> 
> I hope I did not mess up any technical details and that it is
> understandable enough.
> 
>  local-zfs.adoc | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 96 insertions(+)
> 
> diff --git a/local-zfs.adoc b/local-zfs.adoc
> index fd03e89..48f6540 100644
> --- a/local-zfs.adoc
> +++ b/local-zfs.adoc
> @@ -151,6 +151,102 @@ rpool/swap        4.25G  7.69T    64K  -
>  ----
>  
>  
> +[[sysadmin_zfs_raid_considerations]]
> +ZFS RAID Level Considerations
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +There are a few factors to take into consideration when choosing the layout of
> +a ZFS pool.
> +
> +
> +[[sysadmin_zfs_raid_performance]]
> +Performance
> +^^^^^^^^^^^
> +
> +Different types of VDEVs have different performance behaviors. The two
we have a few mentions of vdev (written without caps in the
system-booting.adoc) - for consistency either write it small here as well
or change the system-booting part.

as for the content - a short explanation what a vdev is might be helpful,
and mentioning that all top level vdevs in a pool are striped (as in
RAID0) together

> +parameters of interest are the IOPS (Input/Output Operations per Second) and the
> +bandwidth with which data can be written or read.
> +
> +A 'mirror' VDEV will approximately behave like a single disk in regards to both
> +parameters when writing data. When reading data if will behave like the number
> +of disks in the mirror.
in the section above (and in the installer and disk management GUI we talk about
RAIDX - maybe refer to this at least in paranthesis:
A 'mirror' VDEV (RAID1) ....

> +
> +A common situation is to have 4 disks. When setting it up as 2 mirror VDEVs the
here the same with RAID10
> +pool will have the write characteristics as two single disks in regard of IOPS
> +and bandwidth. For read operations it will resemble 4 single disks.
> +
> +A 'RAIDZ' of any redundancy level will approximately behave like a single disk
> +in regard of IOPS with a lot of bandwidth. How much bandwidth depends on the
> +size of the RAIDZ VDEV and the redundancy level.
> +
> +For running VMs, IOPS is the more important metric in most situations.
> +
> +
> +[[sysadmin_zfs_raid_size_space_usage_redundancy]]
> +Size, Space usage and Redundancy
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +While a pool made of 'mirror' VDEVs will have the best performance
> +characteristics, the usable space will be 50% of the disks available. Less if a
> +mirror VDEV consists of more than 2 disks. To stay functional it needs to have
maybe add (e.g. a 3-way mirror) after 2 disks

s/To stay functional it needs to have at least one disk per mirror VDEV
available/At least one healthy disk per mirror is needed for the pool to
work/ ? 
> +at least one disk per mirror VDEV available. The pool will fail once all disks
> +in a mirror VDEV fail.
maybe drop the last sentence

> +
> +When using a pool made of 'RAIDZ' VDEVs the usable space to disk ratio will be
> +better in most situations than using mirror VDEVs. This is especially true when
Why not actively describe the usable space: The usable space of a 'RAIDZ'
type VDEV of N disks is roughly N-X, with X being the RAIDZ-level.
The RAIDZ-level also indicates how many arbitrary disks can fail without
losing data. (and drop the redundancy sentence below

> +using a large number of disks. A special case is a 4 disk pool with RAIDZ2. In
> +this situation it is usually better to use 2 mirror VDEVs for the better
> +performance as the usable space will be the same. In a RAIDZ VDEV, any drive
> +can fail and it will stay operational. The number of sustainable drive failures
> +is defined by the redundancy level, a RAIDZ1 can survive the loss of 1 disk,
> +consequently, a RAIDZ2 the loss of 2 and a RAIDZ3 the loss of 3 disks.
> +
> +Another important factor when using any RAIDZ level is how ZVOL datasets, which
> +are used for VM disks, behave. For each data block the pool needs parity data
> +which are at least the size of the minimum block size defined by the `ashift`
which _is_ at least the size of?

> +value of the pool. With an ashift of 12 the block size of the pool is 4k.  The
> +default block size for a ZVOL is 8k. Therefore, in a RAIDZ2 each 8k block
> +written will cause two additional 4k parity blocks to be written,
> +8k + 4k + 4k = 16k.  This is of course a simplified approach and the real
> +situation will be slightly different with metadata, compression and such not
> +being accounted for in this example.
> +
> +This behavior can be observed when checking the following properties of the
> +ZVOL:
> +
> + * `volsize`
> + * `refreservation` (if the pool is not thin provisioned)
> + * `used` (if the pool is thin provisioned and without snapshots present)
> +
> +----
> +# zfs get volsize,refreservation,used /<pool>/vm-<vmid>-disk-X
> +----
the '/' in the beginning should be dropped
> +
> +`volsize` is the size of the disk as it is presented to the VM, while
> +`refreservation` shows the reserved space on the pool which includes the
> +expected space needed for the parity data. If the pool is thin provisioned, the
> +`refreservation` will be set to 0. Another way to observe the behavior is to
> +compare the used disk space within the VM and the `used` property. Be aware
> +that snapshots will skew the value.
> +
> +To counter this effect there are a few options.
s/this effect/the increased use of space/
> +
> +* Increase the `volblocksize` to improve the data to parity ratio
> +* Use 'mirror' VDEVs instead of 'RAIDZ'
> +* Use `ashift=9` (block size of 512 bytes)
> +
> +The `volblocksize` property can only be set when creating a ZVOL. The default
> +value can be changed in the storage configuration. When doing this, the guest
> +needs to be tuned accordingly and depending on the use case, the problem of
> +write amplification if just moved from the ZFS layer up to the guest.
> +
> +A pool made of 'mirror' VDEVs has a different usable space and failure behavior
> +than a 'RAIDZ' pool.
This is already explained above?

Maybe add a short recommendation - 'RAID10 has favorable behavior for VM
workloads - use RAID10, unless your environment has specific needs and
characteristics where RAIDZ performance characteristics are acceptable' ?


> +
> +Using `ashift=9` when creating the pool can lead to bad
> +performance, depending on the disks underneath, and cannot be changed later on.
> +
> +
>  Bootloader
>  ~~~~~~~~~~
>  






More information about the pve-devel mailing list