ZFS and ZFS-FUSE - Luis Johnstone

ZFS is touted as the uber-filesystem.
There are plenty of guides to ZFS that can be found by googling and for that reason I’m going to keep ZFS-specific talk to a minimum, focusing on ZFS in FUSE (zfs-fuse).
See http://en.wikipedia.org/wiki/ZFS for a primer on ZFS.

ZFS has a number of components but the higher-level ones that are interesting for most users will be that ZFS is a filesystem and LVM (Logical Volume Manager).

Some of the most relevant features are:

Copy-On-Write (COW) transactional model

This means that blocks of data are never over-written when copied or modified. Instead, the data is written out to new unused blocks and then the block-pointers and other meta-data are updated. This means that should there be an interruption to the write (say, as the result of a power cut) then the data will be in only one of two states: the meta-data etc., points to the old data or it points to the new data.

All data is checksummed

This means that all user data is checksummed as is all meta-data. In fact, even the Vdev labels are checksummed (Vdev labels are 256KB structures that sit on each physical device in a zpool and contain information about this vdev/ device as well as all others that are members of the same zpool)*.
This means that even in a non-redundant configuration ZFS can detect (but not repair) data corruption. If you add in mirroring then ZFs can not only detect corruption but it should be able to repair it too.

The COW model means that most data corruption shouldn’t occur in the first place. However, silent data-corruption is still a problem. ZFS has an advantage is that checksumming of data provides a means to detect such data corruption, while mirroring provides the means to repair the corruption.

Because of the COW model, ZFS doesn’t have or need an equivalent of chkdsk (Windows) or fsck (linux). Instead pools can be scrubbed using the scrub command:

zfs scrub

This command causes ZFS to verify all checksums and as such differs greatly from the aforementioned tools. The scrub command is an excellent way to check for data corruption in the entire pool. However, since it scrubs whole pools it can take a long time to complete. Thankfully, scrubs can be cancelled at any point by issuing the following at a terminal:

zfs scrub -s

Scrubs do not require exclusive access to zpools or require zpools to be unmounted in the way that NTFS, ext3 partitions must be to execute their respective filesystem checks. As such ZFS has online filesystem checking. Moreover, while scrubbing is I/O intensive one can still use zpools without excessive side-effects under light work-loads.

*for a more precise explanation see the ZFS On-Disk Specfication,section 1.2.

Nested filesystems

You create filesystems within each zpool. Strictly speaking a zpool is not a created filesystem. You can create a zpool as follows:

zfs create {device1, device2}

and that pool should be automatically mounted to a folder of the same name.
We use this next command to list the zpools, filesystems (and snapshots):

zfs list

On my machine this gives the output:

NAME   USED   AVAIL REFER   MOUNTPOINT
DISK1   1.96T    732G     1.65T       /DISK1
DISK2   773G    141G     773G       /DISK2

You can store data in the pool ‘as is’, and that is exactly what I do for most of my data.
However, and here’s the really clever part, you can create filesystems within the pool with the following command:
zfs create /
eg.

zfs create DISK1/winbackup

Now the output of zfs list is:

NAME                    USED AVAIL REFER MOUNTPOINT
DISK1                    1.96T   732G     1.65T     /DISK1
DISK1/winbackup 323G   732G     314G     /DISK1/winbackup
DISK2                    773G   141G     773G     /DISK2

We can nest these filesystems too:
zfs create DISK1/winbackup
again, zfs list gives us:

NAME                                      USED AVAIL REFER MOUNTPOINT
DISK1                                     1.96T    732G    1.65T /DISK1
DISK1/winbackup                   323G    732G    314G /DISK1/winbackup
DISK1/winbackup/documents 8.79G    732G   8.79G /DISK1/winbackup/documents
DISK2                                     773G     141G    773G /DISK2

Mounted, each filesystem appears to be a sub-folder of DISK1. So why bother creating filesystems instead of creating folders? Well, because some very ZFS properties can be set on a per-filesystem basis.
Properties such as compression, checksumming, quotas, mountpoints, read-only. Also more interesting properties like independently setting the values of the Primary and Secondary ZFS caches to cache either data & meta-data, meta-data only, or no caching at all. Blocksizes (see the ‘recordsize’ property) can also be set on a per-filesystem basis but (as best as I can tell) cannot be changed once data is written to the filesystem. For a list of ZFS properties execute:

zfs get all

leave the zpool name blank to get values for all pools. it’s easy to change values using:
zfs set <:property>= e.g.

zfs set compression=off DISK1
ZFS set checksum=off DISK1/winbackup

Another reason why adding filesystems in this way can be desirable are the zfs snapshot, zfs send and receive commands.
zfs snapshot @
e.g.

zfs snapshot DISK1/winbackup/documents@monday

allows you to create a read-only snapshot of a given filesystem
At creation this snapshot comsunes no additional data; the size of the snapshot increases as the pool from which the snapshot was created is altered. This is not to be considered a proper form of backup since if one destroys the pool containing the sna
pshot then the data will be lost.

The send and receive commands allow one to send and receive ZFS filesystems from one pool to another (even over a network). Because the filesystem is sent and receive rather than simply the data one is also preserving per-filesystem settings like ACLs and the other zfs properties of the kind mentioned earlier.
However, at this time (May, 2010) zfs-fuse does not support send/receive.

A downside of ZFS at this point in time is the inability to shrink pools nor remove top-level vdevs in a pool.

More to come…

Leave a Reply Cancel reply