ZFS root advantage

Taking advantage of ZFS on root

Last updated January 24, 2025

A look at limited support of ZFS by Proxmox VE stock install. A primer on ZFS basics insofar ZFS as a root filesystem setups - snapshots and clones, with examples. Preparation for ZFS bootloader install with offline backups all-in-one guide.

Proxmox seem to be heavily in favour of the use of ZFS, including for the root filesystem. In fact, it is the only production-ready option in the stock installer in case you would want to make use of e.g. a mirror. However, the only benefit of ZFS in terms of Proxmox VE feature set lies in the support for replication across nodes, which is a perfectly viable alternative for smaller clusters to shared storage. Beyond that, Proxmox do NOT take advantage of the distinct filesystem features. For instance, if you make use of Proxmox Backup Server (PBS), there is absolutely no benefit in using ZFS in terms of its native snapshot support.

Note

The designations of various ZFS setups in the Proxmox installer are incorrect - there is no RAID0 and RAID1, or other such levels in ZFS. Instead these are single, striped or mirrored virtual devices the pool is made up of (and they all still allow for redundancy), meanwhile the so-called (and correctly designated) RAIDZ levels are not directly comparable to classical parity RAID (with different than expected meaning to the numbering). This is where Proxmox prioritised the ease of onboarding over the opportunity to educate its users - which is to their detriment when consulting the authoritative documentation.

ZFS on root

In turn, there is seemingly few benefits of ZFS on root with a stock Proxmox VE install. If you require replication of guests, you absolutely do NOT need ZFS for the host install itself. Instead, creation of ZFS pool (just for the guests) after the bare install would be advisable. Many would find this confusing as non-ZFS installs set you up with with LVM instead, a configuration you would then need to revert, i.e. delete the superfluous partitioning prior to creating a non-root ZFS pool.

Further, if mirroring of the root filesystem itself is the only objective, one would get much simpler setup with a traditional no-frills Linux/md software RAID solution which does NOT suffer from write amplification inevitable for any copy-on-write filesystem.

No support

No built-in backup features of Proxmox take advantage of the fact that ZFS for root specifically allows convenient snapshotting, serialisation and sending the data away in a very efficient way already provided by the very filesystem the operating system is running off - both in terms of space utilisation and performance.

Finally, since ZFS is not reliably supported by common bootloaders - in terms of keeping up with upgraded pools and their new features over time, certainly not the bespoke versions of ZFS as shipped by Proxmox, further non-intuitive measures need to be taken. It is necessary to keep “synchronising” the initramfs and available kernels from the regular /boot directory (which might be inaccessible for the bootloader when residing on an unusual filesystem such as ZFS) to EFI System Partition (ESP), which was not exactly meant to hold full images of about-to-be booted up systems originally. This requires use of non-standard bespoke tools, such as proxmox-boot-tool.

So what are the actual out-of-the-box benefits of with Proxmox VE install? None whatsoever.

A better way

This might be an opportunity to take a step back and migrate your install away from ZFS on root or - as we will have a closer look here - actually take real advantage of it. The good news is that it is NOT at all complicated, it only requires a different bootloader solution that happens to come with lots of bells and whistles. That and some understanding of ZFS concepts, but then again, using ZFS makes only sense if we want to put such understanding to good use as Proxmox do not do this for us.

ZFS-friendly bootloader

A staple of any sensible on-root ZFS install, at least with a UEFI system, is the conspicuously named bootloader of ZFSBootMenu (ZBM) - a solution that is an easy add-on for an existing system such as Proxmox VE. It will not only allow us to boot with our root filesystem directly off the actual /boot location within - so no more intimate knowledge of Proxmox bootloading needed - but also let us have multiple root filesystems at any given time to choose from. Moreover, it will also be possible to create e.g. a snapshot of a cold system before it booted up, similarly as we did in a bit more manual (and seemingly tedious) process with the Proxmox installer once before - but with just a couple of keystrokes and native to ZFS.

There’s a separate guide on installation and use of ZFSBootMenu with Proxmox VE, but it is worth learning more about the filesystem before proceeding with it.

ZFS does things differently

While introducing ZFS is well beyond the scope here, it is important to summarise the basics in terms of differences to a “regular” setup.

ZFS is not a mere filesystem, it doubles as a volume manager (such as LVM), and if it were not for the requirement of UEFI for a separate EFI System Partition with FAT filesystem - that has to be ordinarily sharing the same (or sole) disk in the system - it would be possible to present the entire physical device to ZFS and even skip the regular disk partitioning altogether.

In fact, the OpenZFS docs boast that a ZFS pool is “full storage stack capable of replacing RAID, partitioning, volume management, fstab/exports files and traditional single-disk file systems.” This is because a pool can indeed be made up of multiple so-called virtual devices (vdevs). This is just a matter of conceptual approach, as a most basic vdev is nothing more than would be otherwise considered a block device, e.g. a disk, or a traditional partition of a disk, even just a file.

Important

It might be often overlooked that vdevs, when combined (e.g. into a mirror), constitute a vdev itself, which is why it is possible to create e.g. striped mirrors without much thinking about it.

Vdevs are organised in a tree-like structure and therefore the top-most vdev in such hierarchy is considered a root vdev. The simpler and more commonly used reference to the entirety of this structure is a pool, however.

We are not particularly interested in the substructure of the pool here - after all a typical PVE install with a single vdev pool (but also all other setups) results in a single pool named rpool getting created and can be simply seen as a single entry:

zpool list

NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool   126G  1.82G   124G        -         -     0%     1%  1.00x    ONLINE  -

But pool is not a filesystem in the traditional sense, even though it could appear as such. Without any special options specified, creating a pool - such as rpool - indeed results in filesystem getting mounted under /rpool location in the filesystem, which can be checked as well:

findmnt /rpool

TARGET SOURCE FSTYPE OPTIONS
/rpool rpool  zfs    rw,relatime,xattr,noacl,casesensitive

But this pool as a whole is not really our root filesystem per se, i.e. rpool is not what is mounted to / upon system start. If we explore further, there is a structure to the /rpool mountpoint:

apt install -y tree
tree /rpool

/rpool
├── data
└── ROOT
    └── pve-1

4 directories, 0 files

These are called datasets within ZFS parlance (and they indeed are equivalent to regular filesystems, except for a special type such as zvol) and would be ordinarily mounted into their respective (or intuitive) locations, but if you went to explore the directories further with PVE specifically, those are empty.

The existence of datasets can also be confirmed with another command:

zfs list

NAME               USED  AVAIL  REFER  MOUNTPOINT
rpool             1.82G   120G   104K  /rpool
rpool/ROOT        1.81G   120G    96K  /rpool/ROOT
rpool/ROOT/pve-1  1.81G   120G  1.81G  /
rpool/data          96K   120G    96K  /rpool/data
rpool/var-lib-vz    96K   120G    96K  /var/lib/vz

This also gives a hint where each of them will have a mountpoint - they do NOT have to be analogous.

Important

A mountpoint as listed by zfs list does not necessarily mean that the filesystem is actually mounted there at the given moment.

Datasets may appear like directories, but they - as in this case - can be independently mounted (or not) anywhere into the filesystem at runtime - and in this case, it is a perfect example of the root filesystem mounted under / path, but actually held by the rpool/ROOT/pve-1 dataset.

Important

Do note that paths of datasets start with a pool name, which can be arbitrary (the rpool here has no special meaning to it), but they do NOT contain the leading / as an absolute filesystem path would.

Mounting of regular datasets happens automatically, something that in case of PVE installer resulted in superfluously appearing directories like /rpool/ROOT which are virtually empty. You can confirm such empty dataset is mounted and even unmount it without any ill-effects:

findmnt /rpool/ROOT

TARGET      SOURCE     FSTYPE OPTIONS
/rpool/ROOT rpool/ROOT zfs    rw,relatime,xattr,noacl,casesensitive

umount -v /rpool/ROOT

umount: /rpool/ROOT (rpool/ROOT) unmounted

Some default datasets for Proxmox VE are simply not mounted and/or accessed under /rpool - a testament how disentangled datasets and mountpoints can be.

You can even go about deleting such (unmounted) subdirectories. You will however notice that - even if the umount command does not fail - the mountpoints will keep reappearing.

But there is nothing in the usual mounts list as defined in /etc/fstab which would imply where they are coming from:

cat /etc/fstab

# <file system> <mount point> <type> <options> <dump> <pass>
proc /proc proc defaults 0 0

The issue is that mountpoints are handled differently when it comes to ZFS. Everything goes by the properties of the datasets, which can be examined:

zfs get mountpoint rpool

NAME   PROPERTY    VALUE       SOURCE
rpool  mountpoint  /rpool      default

This will be the case of all of them except the explicitly specified ones, such as the root dataset:

NAME              PROPERTY    VALUE       SOURCE
rpool/ROOT/pve-1  mountpoint  /           local

When you do NOT specify a property on a dataset, it would typically be inherited by child datasets from their parent (that is what the tree structure is for) and there are fallback defaults when all of them (in the path) are left unspecified. This is generally meant to facilitate a friendly behaviour of a new dataset appearing immediately as a mounted filesystem in a predictable path - and we should not be caught by surprise by this with ZFS.

It is completely benign to stop mounting empty parent datasets when all their children have locally specified mountpoint property and we can absolutely do that right away:

zfs set mountpoint=none rpool/ROOT

Even the empty directories will NOW disappear. And this will be remembered upon reboot.

Tip

It is actually possible to specify mountpoint=legacy in which case the rest can be then managed such as a regular filesystem would be - with /etc/fstab.

So far, we have not really changed any behaviour, just learned some basics of ZFS and ended up in a neater mountpoints situation:

rpool             1.82G   120G    96K  /rpool
rpool/ROOT        1.81G   120G    96K  none
rpool/ROOT/pve-1  1.81G   120G  1.81G  /
rpool/data          96K   120G    96K  /rpool/data
rpool/var-lib-vz    96K   120G    96K  /var/lib/vz

Forgotten reservation

It is fairly strange that PVE takes up the entire disk space by default and calls such pool rpool as it is obvious that the pool WILL have to be shared for datasets other than the one holding root filesystem(s).

That said, you can create separate pools, even with the standard installer - by giving it smaller than actual full available hdsize value:

Proxmox VE Installer - Advanced Options - hdsize

The issue concerning us should not as much lie in the naming or separation of pools. But consider a situation when a non-root dataset, e.g. a guest without any quota set, fills up the entire rpool. We should at least do the minimum to ensure there is always ample space for the root filesystem. We could meticulously be setting quotas on all the other datasets, but instead, we really should make a reservation for the root one, or more precisely a refreservation:

zfs set refreservation=16G rpool/ROOT/pve-1

This will guarantee that 16G is reserved for the root dataset at all circumstances. Of course it does not protect us from filling up the entire space by some runaway process, but it cannot be usurped by other datasets, such as guests.

Tip

The refreservation reserves space for the dataset itself, i.e. the filesystem occupying it. If we were to set just reservation instead, we would include all possible e.g. snapshots and clones of the dataset into the limit, which we do NOT want.

A fairly useful command to make sense of space utilisation in a ZFS pool and all its datasets is:

zfs list -ro space <poolname>

This will actually make a distinction between USEDDS (i.e. used by the dataset itself), USEDCHILD (only by the children datasets), USEDSNAP (snapshots), USEDREFRESERV (buffer kept to be available when refreservation was set) and USED (everything together). None of which should be confused with AVAIL, which is then the space available for each particular dataset and the pool itself, which will include USEDREFRESERV of those that had any refreservation set, but not for others.

Snapshots and clones

The whole point of considering a better bootloader for ZFS specifically is to take advantage of its features without much extra tooling. It would be great if we could take a copy of a filesystem at an exact point, e.g. before a risky upgrade and know we can revert back to it, i.e. boot from it should anything go wrong. ZFS allows for this with its snapshots which record exactly the kind of state we need - they take no time to create as they do not initially consume any space, it is simply a marker on filesystem state that from this point on will be tracked for changes - in the snapshot. As more changes accumulate, snapshots will keep taking up more space. Once not needed, it is just a matter of ditching the snapshot - which drops the “tracked changes” data.

Snapshots of ZFS, however, are read-only. They are great to e.g. recover a forgotten customised - and since accidentally overwritten - configuration file, or permanently revert to as a whole, but not to temporarily boot from if we - at the same time - want to retain the current dataset state - as a simple rollback would have us go back in time without the ability to jump “back forward” again. For that, a snapshot needs to be turned into a clone.

It is very easy to create a snapshot off an existing dataset and then checking for its existence:

zfs snapshot rpool/ROOT/pve-1@snapshot1
zfs list -t snapshot

NAME                         USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT/pve-1@snapshot1   300K      -  1.81G  -

Important

Note the naming convention using @ as a separator - the snapshot belongs to the dataset preceding it.

We can then perform some operation, such as upgrade and check again to see the used space increasing:

NAME                         USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT/pve-1@snapshot1  46.8M      -  1.81G  -

Clones can only be created from a snapshot. Let’s create one now as well:

zfs clone rpool/ROOT/pve-1@snapshot1 rpool/ROOT/pve-2

As clones are as capable as a regular dataset, they are listed as such:

zfs list

NAME               USED  AVAIL  REFER  MOUNTPOINT
rpool             17.8G   104G    96K  /rpool
rpool/ROOT        17.8G   104G    96K  none
rpool/ROOT/pve-1  17.8G   120G  1.81G  /
rpool/ROOT/pve-2     8K   104G  1.81G  none
rpool/data          96K   104G    96K  /rpool/data
rpool/var-lib-vz    96K   104G    96K  /var/lib/vz

Do notice that while both pve-1 and the cloned pve-2 refer the same amount of data and the available space did not drop. Well, except that the pve-1 had our refreservation set which guarantees it its very own claim on extra space, whilst that is not the case for the clone. Clones simply do not take extra space until they start to refer other data than the original.

Importantly, the mountpoint was inherited from the parent - the rpool/ROOT dataset, which we had previously set to none.

Tip

This is quite safe - NOT to have unused clones mounted at all times - but does not preclude us from mounting them on demand, if need be:

mount -t zfs -o zfsutil rpool/ROOT/pve-2 /mnt

Backup on a running system

There is always one issue with the approach above, however. When creating a snapshot, even at a fixed point in time, there might be some processes running and part of their state is not on disk, but e.g. resides in RAM, and is crucial to the system’s consistency, i.e. such snapshot might get us a corrupt state as we are not capturing anything that was in-flight. A prime candidate for such a fragile component would be a database, something that Proxmox heavily relies on with its own configuration filesystem of pmxcfs - and indeed the proper way to snapshot a system like this while running is more convoluted, i.e. the database has to be given special consideration, e.g. be temporarily shut down or the state as presented under /etc/pve has to be backed up by the means of safe SQLite database dump.

This can be, however, easily resolved in more streamlined way - by making all the backup operations from a different, i.e. not on the running system itself. For the case of root filesystem, we have to boot off a different environment, such as when we created a full backup from a rescue-like boot. But that is relatively inconvenient. And not necessary - in our case. Because we have a ZFS-aware bootloader with extra tools in mind.

We will ditch the potentially inconsistent clone and snapshot and redo them later on. As they depend on each other, they need to go in reverse order:

Warning

Exercise EXTREME CAUTION when issuing zfs destroy commands - there is NO confirmation prompt and it is easy to execute them without due care, in particular in terms omitting a snapshot part of the name following @ and thus removing entire dataset when passing on -r and -f switch which we will NOT use here for that reason.

It might also be a good idea to prepend these command by a space character, which on a common regular Bash shell setup would prevent them from getting recorded in history and thus accidentally re-executed. This would be also one of the reasons to avoid running everything under the root user all of the time.

zfs destroy rpool/ROOT/pve-2
zfs destroy rpool/ROOT/pve-1@snapshot1

Ready

It is at this point we know enough to install and start using ZFSBootMenu with Proxmox VE - as is covered in the separate guide which also takes a look at changing other necessary defaults that Proxmox VE ships with.

We do NOT need to bother to remove the original bootloader. And it would continue to boot if we were to re-select it in UEFI. Well, as long as it finds its target at rpool/ROOT/pve-1. But we could just as well go and remove it, similarly as when we installed GRUB instead of systemd-boot.

Note on backups

Finally, there are some popular tokens of “wisdom” around such as “snapshot is not a backup”, but they are not particularly meaningful. Let’s consider what else we could do with our snapshots and clones in this context.

A backup is as good as it is safe from consequences of indvertent actions we expect. E.g. a snapshot is as safe as the system that has access to it, i.e. not any less than tar archive would have been when stored in a separate location whilst still accessible from the same system. Of course, that does not mean that it would be futile to send our snapshots somewhere away. It is something we can still easily do with serialisation that ZFS provides for. But that is for another time.

See also: fstab md zfs clone zfs destroy zfs get zfs list zfs set zfs snapshot zpool list

Your feedback on the content is welcome in the GitHub Discussions.

Proxmox bootloaders