[PVE-User] Some thought about zfs based configurations

Fri Jul 31 11:33:00 CEST 2015

 Hi,

As the recent PVE systems offer very good zfs integration and lxc as replacement of openvz, I have some thought to share with users and developers to consider.

A very short introduction of zfs to understand my thoughts:

it is like a software raid system with lvm and copy-on-write capabilities in a simple package

you can manage all your stuff with only two commands: zpool and zfs

supports different raid levels (they called raidz), similar to raid0, raid1, raid10, raid5, raid6, even raid with 3 parity drives (for example 8+3 drives)

supports filesystem and block devices

first priority is to protect your data against any kind of damages

copy-on-writes -> it will never overwrite existing data -> creating a new block instead (atomic writes) -> the data on the disk will be always consistent (!)

it can create as many snapshots as needed as frequently as one wants and it is possible to send the snapshot to a remote server for replicating or backup purposes. Snapshots takes no time and there is no overhead (like a full tar.gz every two hours), only increasing with new/modified data.

more about zfs at zfsonlinux.org

So, if we choose zfs for the filesystem, the fun begins :)

Part one: Storing CTs and VM disk images and backups.

In general, when we create a new VM/CT, pve creates its directory structure and put some files under that directory. VM uses different file formats, CT uses directly that directory and put its files under that. LXC will use a file for storing its files.

In ZFS, it is possible to create independent filesystems for all VMs and CTs, instead of using a top-level filesystem and put every VMs/CTs under that. To achieve this, it would be necessary to rewrite the storage subsystem to handle the zfs case.

Benefits of using such a layout:

every VM and CT are completly separated from each others

it is possible to modify all parameters independently from each others (compression, atime, arc/l2arc cache - none, all, metadata, etc.)

it is possible to create even more sub-filesystems dedicated to database servers (for example the system uses lz4 compression and full cache, but for postgresql it only caches metadata and turn off compression, using 8k block sizes etc.)

it is possible to create backups in no time, instead of hammering all disks and system for full night every day and one only has 1 backup per day

it is possible to send snapshots for remote backup or replication over the network, via vpn/ssh, independently of each others

and what I forgot

Part two: Virtual disks for VMs

As the underlaying filesystem provides protection, cache, on-the-fly compression, my recommendation is using raw files as disks or zfs block devices. There are different use cases where a raw file could be better than a volume (block device). I use raw files.

As the VM could use any kind of filesystem, the benefits of zfs are not as impressive as in original. Ext2/3/4 and other filesystems can be corrupted after a system crash or a snapshot, so, a cow filesystem inside the VM could be better. For example a btrfs can work, where compression, block checksum, cache are not really necessary, but the cow feature is handy.

Imagine the system startup after a crash: even the hardware node can start quickly, the VM must do an fsck and data loss, filesystem corruption can happen. With a cow filesystem the chance of data loss could be minimized.

Of course, choosing a filesystem for the VM is up to the "end-user", so, she/he can control as she/he wants and trust. My recommendation: tune the filesystem inside the VM or choose a cow one.

(Note: as I can see, docker uses btrfs, too).

Part three: Containers with direct access to the files from the hardware node (not virtual disk)

I like the openvz approach, where the whole system is inside a directory, so, the CT uses directly the filesystem of the hardware node. In this case, there are no overhead or filesystem corruption risk, due to that, the CT uses ZFS "directly". Of course, on application level, for example mysql inconsistent situation can happen on crash or snapshot (anyway, I did never experience this).

So, using ZFS with CT like this is a very good combination: individual files easily accessible from hardware nodes, even from snapshots without any magic trick: quick and elegant. And a plus, we have all zfs benefits without hacking the CT.

Part four: Containers with "virtual disks"

The same situation as the VM: using such a disk is sucks. No real benefit of using this (except in some cases, like using glusterfs) on standard configuration. As I read on some mailing lists, this lxc+virtual disk could have a big storage overhead plus lost ZFS benefits: data loss or corruption can happen inside this VM.

As I wrote earlier, I tried docker and as I can see, it uses btrfs as filesystem, which can solve these kind of problems. Tuning is necessary to get the maximum: block checksum and compression not necessary.

Using filesystem on lxc containers not an user choice, but choosed by the system developer, in this case Proxmox team.

I would like to recommend to consider using btrfs instead of ext4 on lxc containers. I think, this choice is better in any case, even the underlaying filesystem is not zfs.

Summary

I think using ZFS as filesystem has more potential than recent storage/backup model used by pve 3.4 or 4.x.

I wrote this "article" to try to push proxmox team to improve the system in a way or give them a feedback, your direction is a good way :)

I have a proxmox node, which is using zfs in such a way I described (except btrfs) and that node is up an running for more than 657 days now.

I hope developers find my thoughts useful and can help them to make a better system :)

Thank you and sorry about this long email :)

Bye,

István
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.proxmox.com/pipermail/pve-user/attachments/20150731/396d4a3c/attachment.htm>