Update to my ZFS backup strategy

I briefly outlined in my ZFS backup strategy blogpost about my NAS setup, but here it’s a quick recap: I have a Raspberry Pi 4 4GB with a 1TB SATA HDD over USB running under the TV in my living room, and a second USB HDD for mirroring. I’ve been running this setup for around 18 months now, and unfortunately it doesn’t quite fit my needs.

In the previous post, I focused too much in the remote/cloud backups for ZFS, so I just took it for granted that mirroring the disks would be trivial using ZFS. While ZFS does mirroring by default, now I understand that it’s intended as a solution for always-online disks, so I couldn’t rely on that feature without ZFS constantly nagging that the zpool is unhealthy and resilvering the disk every time I plugged it in. To get around that, I’ve decided to keep the zpool with a single disk, and zfs send the data to the second disk once every when I felt like, mostly because once the disks were fully synced, the delta between the second disk and what’s stored on the cloud would be quite small (<100MB).

That was a rookie mistake. Not only it meant I wouldn’t have 3 copies of the data anymore (or at least for the data changes from the last full sync), the data sync between those two disks was a complete disaster. I didn’t do a proper research and did my initial sync only after I had already commited to this strategy and had moved in all my data. This meant that I failed to test a pretty trivial thing: can the Raspberry Pi power two USB HDDs at once? It can’t. It will boot fine with one disk, but it won’t boot with two, and plugging the second disk after it’s fully booted and idling still wasn’t enough for the second disk to boot.

With that limitation, I told myself that it would be fine if I plugged the second disk in my Lenovo notebook and did a sync over the network. My home network is gigabit capable, and the syncs wouldn’t be that big after the initial full sync, plus it meant that I’d have ZFS on Linux already setup in my Fedora installation, in case the main disk went bust and I’d need quick access to the files. I wrote a script for ssh nas zfs send | zfs recv and, oh my, how that was slow. The Raspberry Pi does not have any cryptographic hardware acceleration, so the transfer speed was capped to ~16MB/s, which meant several hours for the initial sync.

I’ve tried to help the Raspberry Pi a bit by doing a non-encrypted transfer with zfs send | nc and nc -l over a direct cable connection between the Pi and the notebook, and I got close to 85MB/s in that scenario, which was a lot more palatable. However, it also meant a tripping hazard (a.k.a.: cable) spanning from my living room to my office for a few hours, so I began to regret my decisions. But at least now the disks were synced and subsequential syncs were manageable over SSH, as the deltas were small.

Enter kernel updates.

Fast forward 3 months, I’m back from a holiday abroad in which I took my NAS server with me (but not the second disk). I kept the NAS server up to date, and ZFS on Linux mostly worked across updates (although it was a bit of a pain between ZOL 0.8.4 and 2.0.0). Given my Lenovo notebook is my main personal machine, I use Fedora because of the speed packages get updated. This also meant, after I was back from vacation, that I had the 5.16 Kernel running in it, but ZFS on Linux’s support for it wouldn’t come for another 2 months.

Sidenote: this post is in no way a critique to the ZFS on Linux project. Using ZFS at all is only possible because of their work, and I understand well enough the problems with unpaid Open Source contributors and “AS IS” and “no guarantees” open source licenses. I’m thankful for the multiple authors of OpenZFS and ZOL that made this possible, and all the faults on my strategy are solely my own.

This mean my strategy fell into pieces. My initial expectation was that I’d just plug the second disk to the NAS server every second week or so and it would automagically keep both disks mirrored. In practice, I had to constantly make effort into keeping both disks in sync, plugging into a second machine, fiddling around with some commands and a bit of lack of preparation made me resync the disk once again because I screwed up some parameters in the zfs recv.

ZOL 2.1.3 was released last week with 5.16 Kernel support, but I haven’t got around to sync the disks again, and now they are unsynced for about 4 months. I kept the snapshots in the primary disk, so I should be able to do an incremental sync whenever I get around it, but the situation I’ve put myself doesn’t get me excited to jump into that problem. Also, this doesn’t mean it won’t happen again, so I have to start from scratch and develop a new strategy that considers both cloud incremental backups as much as mirroring to a secondary disk that will not stay connected all the time. Maybe ZFS is not the answer for me, maybe Raspberry Pi is not the answer for me, maybe ZOL is not the answer for me. I don’t know yet.

The good part of this experiment is that I’ve got to learn what not to do. So for my future self, a small summary of learnings about this experiment:

Consider the impacts of bleeding edge package updates for infrastructure services;
The server should be capable of executing the entire backup strategy on its own;
Consider the power requirements of your system;
Servers should have cryptographic hardware acceleration;
Don’t take for granted trivial parts of the backup processes and validate the entire model before commiting to it;
Manual steps in a repetitive process must take into consideration your willpower to deal with it;
Under no circumstance assume your partner will be happy with a UTP cable crossing the living room;

If you do have any suggestions for any of my downfalls in this project, I’ll appreciate the feedback.

Thank you.