Am I misreading or did he create a non-redundant pool spanning the 2 SSD drives? I don't think scrubbing will keep him from losing data when one of those drives fails or gets corrupted.
Edit: Looked again and he's getting redundancy by running multiple NASes and rsyncing between them. Still seems like a risky setup though.
In the just-released 2.2.0 you can correct blocks from a remote backup copy:
> Corrective "zfs receive" (#9372) - A new type of zfs receive which can be used to heal corrupted data in filesystems, snapshots, and clones when a replica of the data already exists in the form of a backup send stream.
> I don't think scrubbing will keep him from losing data when one of those drives fails or gets corrupted.
It pains me to see a ZFS pool with no redundancy because instead of being "blissfully" ignorant of bit rot, you'll be alerted to its presence and then have to attempt to recover the files manually via the replica on the other NAS.
I appreciate that the author recognizes what his goals are, and that pool level redundancy is not one of them, but my goals are very different.
In my setup I combine one machine with ZFS with another running btrsfs, therefore using rsync/restic. And Apple devices uses APFS. I'd rather use ZFS solely (or in future: bcachefs) but unfortunately Apple went w/their native next-gen CoW filesystem and I don't think ZFS is available for Synology DSM. Though perhaps I should simply replace DSM with something more my liking (Debian-based, Proxmox).
On the other hand, I myself use a similar approach. For home purpose I prefer having redundancy across several cheaper machines without raid1,10,5,6,z than a more expensive single machine with disk redundancy. And I'd rather pay the additionnal money on putting ECC ram on all machines.
For home-use this is a reasonable tradeoff. Imagine my non tech-savvy wife for some reason has to get access to the data when one of the NAS has malfunctioned because the zfs pool encryption was badly configured. Explaining "zfs receive" or let alone what zfs, linux and ssh is going to be grounds for divorce. Heck, i don't even want to read man pages about zfs myself on weekends, i have enough such problems at work. Besides, you still want 2 physical locations.
It's going to be less optimal and less professional. That's ok, for something as important as backups, keep it boring. Simply starting up the second box is simple, stupid and has well-understood failure modes. Maybe someone like me should just buy an off-the shelf NAS.
That seems like a pretty reasonable plan actually. If I ever do a NAS I was thinking to have one disk for storing files, and a pair of of disks for backing up both the first disk and my laptop.
That way everything has 3 copies, but I'm not backing up a compressed backup that might not deduplicate well and might be a little excessive to have 4 copies.
It wasn't mentioned in the blog, but you can set `copies=N` on ZFS filesystems and ZFS will keep N copies of user data. This provides redundancy that protects against bitrot or minor corruption, even if your zpool isn't mirrored. Of course, it provides no protection against complete drive failure.
Clustering filesystems are old. RedHat at some point took over the company behind GFS perhaps worth looking into. I've also seen people in edge computing space go for Longhorn instead of Ceph. But I don't know anything about it apart that Ceph is available in Proxmox.
My homelab with Turing Pi 2 is running RPi4 CM and an Nvidia Jetson. Each also have a small SSD (one native, two miniPCI to SATA, and one USB). They could be used with k8s or a clustering filesystem but haven't played with it as of yet.
> ceph is never the answer. Its a nice idea, but in practice its just not that useful.
Depends. At my last job we used it for our OpenStack block/object store and it was performant enough. When we started it was HDD+SSD, but after a while (when I left) the plan was to go all-NVMe (once BlueStore became a thing).
My understanding after reading and testing is that we're talking at least two orders of magnitude difference unless you have a lot of disks (certainly more than the 6 I tried with) and quite beefy hardware.
Haven't tried all-SSD cluster yet though, only spinning rust.
I noticed that the author is using ZFS-native encryption, which in my experience is not particularly stable. I've even managed to corrupt an entire pool with ZFS send/receives when trying to make backups. I'd strongly recommend using ZFS-on-LUKS instead if encryption is required.
I have to agree, the leadership’s response and handling of errors in regards to encryption has been… honestly pretty disappointing to see as a zfs fan and user.
While it’s not common, the amount of people running into edge cases and killing both their sending and sometimes even receiving pools (better have another backup!) is frankly unacceptable. “Raw sends” seem to be especially at risk, though sending in general seems to be where the issues mostly lay. My thoughts mirror the comments here: https://discourse.practicalzfs.com/t/is-native-encryption-re...
I was quite surprised when I learned that ZFS-native encryption is so underdeveloped, and that the developers still sometimes seem to forget about it when coding new features. I had presumed it would be table stakes for enterprise storage by now. Is it just the case that everyone uses it on GELI or LUKS, or am I wrong to presume that encryption is as widespread in enterprise deployments as I thought?
It's a shame, because the main draw of the native encryption to me is being able to have a zero-trust backup target with all the advantages of zfs send over user-level backup software. But I've heard of several people running into issues doing this (though luckily no actual data loss that I've heard, just headaches)
I think large enterprises usually do encryption on the application level and not storage level these days. A KMS or encryption service, then storage services just hold encrypted blobs.
The fundamental issue with ZFS encryption is that the primary developer that created it is no longer contributing significantly to the project. It's good code, with good tests, but it's not getting any additional love.
The utilities and tooling surrounding encryption are also weak, and there are ways you can throw away critical, invisible keydata without realizing it, and no tool to allow the correction of the issue, even if you have the missing keydata on another system.
It’s interesting that there hasn’t been anyone since to continue developing Tom’s code. I wonder how long the situation might continue. If no one wants to take over in near future, they might have to remove the feature.
I've been using ZFS with native encryption (Ubuntu Server) but also ZFS with LUKS (Arch, Ubuntu Desktop). Zero issues (though inability to run latest kernel can be annoying, esp on rolling distro). Wouldn't surprise me if write cache has a role in this issue though.
At EuroBSDcon 2022, Allan Jude gave the presentation "Scaling ZFS for NVMe":
> Learn about how ZFS is being adapted to the ways the rules of storage are being changed by NVMe. In the past, storage was slow relative to the CPU so requests were preprocessed, sorted, and coalesced to improve performance. Modern NVMe is so low latency that we must avoid as much of this preprocessing as possible to maintain the performance this new storage paradigm has to offer.
> An overview of the work Klara has done to improve performance of multiple ZFS pools of large numbers of NVMe disks on high thread count machines.
[…]
> A walkthrough of how we improved performance from 3 GB/sec to over 7 GB/sec of writes.
Oh I remember this guy! Michael has a nice blog where he writes interesting posts about hardware, networks and all sort of computer stuff. Sometimes I'm feeling a little bit jealous with all these "toys" he has, but he writes them in a good way (without wanting to show off). If you haven't seen it, read about when he upgraded his internet to 25 Gbit/s fiber: https://michael.stapelberg.ch/posts/2022-04-23-fiber7-25gbit...
Coming back on topic, the idea of having 3 custom made NASes is surely interesting, but apart from learning and experimenting with all these, I don't see a very big advantage from backup/security point of view compared to commecrial NASes (Synology, QNAP).
For sure we can all argue here about selection of filesystem (ZFS, btrfs, ext4...), selection of CPU, RAM type and all the others, but it all boils down to what each one wants (and has the money to spare). IMHO I wouldn't go with QVO SSDs and non redundant volumes (especially spanning 2 disks), but hey that's just me :-)
We are currently testing a number of systems with 12x 30TB NVMe drives with Debian 12 and ZFS 2.2.0. Each of our systems have 2x 128G EPYC CPUs, 1.5TB of RAM, and a dual-port 100GbE NICs. These systems will be used to run KVM VMs plus general ZFS data storage. The goal is to add another 12x NVMe drives and create an additional storage pool.
I have spent an enormous amount of time over the past couple of weeks tuning ZFS to give us the best balance of reads-vs-writes, but the biggest problem is trying to find the right benchmark tool to properly reflect real-world usage. We are currently using FIO but sheer number of options (depth queue, numjobs, libaio vs io_uring) makes the tool unreliable.
For example, comparing libaio vs io_uring with the same options (numbjobs, etc) makes a HUGE different. In some cases, io_uring gives us double (or more) performance than libaio, however, io_uring can produce numbers that don't make any sense (eg: 105GB/sec reads for a system that maxes out at 72B/sec). That said, we were able to push > 70GB/secs large-block reads (1M) from 12x NVMe drives which seems to validate ZFS can perform well on these servers.
OpenZFS has come a long way from the 0.8 days, and the new O_DIRECT option coming out soon should give us even better performance for the flash arrays.
If you are seeing unreasonably fast read throughput, it is likely that reads are being served from the ARC. If your workload will benefit from the ARC, you may be seeing valid numbers. If your workload will not benefit from the ARC, set primarycache=metadata on the dataset and rerun your test, potentially with a pool export/import or reboot to be sure the cache is cleared.
The fact that fio has a bunch of options doesn’t make the tool unreliable. Not understanding the tool or what you are testing makes you unreliable as a tester. The tool is reliable. As you learn you will become a more reliable tester with it.
After seeing some of the unrealistic numbers, I set primarycache=metadata just like you pointed out. And, you are correct, I need to learn to be a better tester...
I design similar NVMe-based ZFS solutions for specialized media+entertainment and biosciences workloads and have put massive time into the platform and tuning needs.
Also think about who will be consuming data. I've employed the use of an RDMA-enabled SMB stack and client tuning to help get the best I/o characteristics out of the systems.
It depends on the use case. For high-speed microscopes, I may get a request that says, "we need to support 4.2 Gigabytes/second of continuous ingest for an 18-hour imaging run." - In those situations, it's best to test with realistic data.
For general video and media workloads, it may be something like, "we have to accommodate 40 editors working over 10GbE (2 x 100GbE at the server) and minimize contention while ingesting from these other sources".
I work with iozone to establish a baseline. I also have a "frametest" utility that helps when mimicking some of the video characteristics.
Why disable swap? And...only 8gb of ram (am I misreading that?) with zfs? It's been years since I've needed a bunch of storage but I remember zfs being much happier with gobs of ram.
ZFS doesn’t need much ram unless you have specific needs. People have run basic zfs storage machines on 4GB of memory total and it works fine. Even 2GB has been done before, but that’s a bit too low for me to suggest. Broadly, 16GB of total system is personal general recommended baseline, because honestly it’s simply too cheap not to have it.
ZFS needs some amount of ram in order to even load the pool, this only becomes a practical concern when your pool gets into the 100’s of TiB.
Deduplication, which should generally not be used, used to be awful with ram consumption, but these days isn’t nearly as bad and can be surprisingly viable. Dedupe can still cause unexpected performance issues unless your skillset is into digging into system analysis and tuning.
You need some amount of RAM buffer for the write TXGs to coalesce efficiently. Generally not a concern.
Finally there’s ARC, which is where all the nice things that improve your experience happen. The more the better, but just like system ram, once you have enough for your usage profile, you stop noticing much benefit beyond that. For dumb file storage, not that much is needed. Ideally you want enough to keep all the metadata, plus whatever your actual repeated read access would be. Working on video editing and VMs would require more RAM for a more optimal experience.
Like I mentioned it's been a long time since I touched this stuff. I do think I set up my (underpowered) system aggressively with ARC now that you mention it. Probably was experimenting with dedup as well. I'm sure some of those things have had improvements over the years.
I'm using XigmaNAS with two zfs pools and only 4G RAM since like 8 years, also on a seemingly underpowered machine (Atom D410). Granted that it doesn't fly, but for home use is more than enough say for serving on the home LAN multiple HD movies at the same time, provided one doesn't load it with other more heavy services. Mine is currently only serving files through NFS and SMB and bittorrent; all else is turned off.
BTW, currently working around a bigger one that will run on a faster mini PC and a 8 bay USB3.1 enclosure. I'm a little wary about USB connectors in this context, and I'll likely have to secure cables firmly to avoid wearing and accidental pulls, but so far results on the bench are promising.
Because we live in the real world and no system is truly in a bubble if it's connected to a network. But mostly because seeing oom-killer messages in dmesg makes me sad :).
If something gets leaky enough then it's still going to hurt performance for a while and die.
Swap delays that but extra RAM delays it too. If you take a use case where 2GB of memory is fine, and give it 8GB, then you already solved the problem swap would solve. You can always add more but you're past the point of diminishing returns.
I did something similar this year with 12x 4tb NVMe drives. I had a 12x 2tb setup is an fashstor 12 but the CPU and RAM limitations led me to traditional RAID there. For the second NAS I just went standard parts but with extreme performance, 13900K and the like. To get the necessary PCIe lanes (since not every port would bifurcate) I used 3 4 way switches from aliexpress. Choosing a motherboard with 2 x8 cpu slots made the bandwidth not a problem, the third does hang off the demo 4.0 lanes though. The boot drive hangs directly off the cpu in the x4 slot. The 13900K actually supports ECC in the workstation motherboards as well. I just expose everything via NFS and call it a day
All in all I'm happy with it. I still use spinning rust for the backup though, no sense worrying about that. At the time it was cheaper to get 12x4 TB budget NVMe and the extra parts than go for any reasonable count of SATA SSDs, not sure if that's still true or not. SATA definitely would have been easier, e.g. stuff like in the Flashstor 12x2TB build I had to submit a kernel patch because the cheap drives were duplicating their nsids.
> the CPU and RAM limitations led me to traditional RAID there
IIRC there was some talk that RAID-Z lead to write amplification compared to mirrored drives, thus not good for cheaper SSDs. Haven't had time to sit down and think it through, does anyone know if thats right or wrong?
> I used 3 4 way switches from aliexpress
Did you find any that were significantly cheaper than ~100 USD?
In general I've found cheap SSDs these days have better endurances compared to cheap spinning disks anyways. https://www.servethehome.com/discussing-low-wd-red-pro-nas-h.... I think the drives I got were 1.6 PBW rated, no errors or failures yet. Beyond that, my workload isn't write heavy enough for this to be a concern of mine whether or not there is amplification. That SSDs don't count reads or active drive time in the endurance rating was more important. For the Flashstor build I did RAID4 instead of RAID5 just because I could though.
I want to say that sounds about right on price. If I didn't also have a desire for high single core performance at the same time a used/old Threadripper/Epyc build and bifurcation would probably make more sense. I also disconnected the onboard tiny "definitely going to be noisy as hell in 3 months" low quality fans from them and just rested a 140mm blowing down across the top of the 3 cards at 30% speed. Temps of the controllers and SSDs became better and it's dead silent.
If you don't specifically want the high per core performance of something like a 13900k a used/old Threadripper or Epyc system and bifurcation might make more sense. It'll also enable you to get maximum per drive bandwidth, if that's a concern for your (when you have 12 drives in some form of stripe and parity the per drive bandwidth ends up not being that important for most sane workloads though).
I scooped up some MP34s on sale. The have something like a 1.6 PBW endurance rating and I haven't had a single one fail or start throwing errors yet (though my workload isn't as write heavy as many might have).
That's cool. I use some 4 tb silicon power brand drives. I didn't research them much. Prices are definitely going to fall soon on 8 tb, and I'll likely retrofit things pretty quickly with those.
> My main motivation for using ZFS instead of ext4 is that ZFS does data checksumming, whereas ext4 only checksums metadata and the journal, but not data at rest.
There's also dm-integrity [1]:
> The dm-integrity target emulates a block device that has additional per-sector tags that can be used for storing integrity information.
I'd be interested to see a follow up in time to see what the IO wear-n-tear looks like.
Also
"Using gokrazy instead of Ubuntu Server would get rid of a lot of moving parts. The current blocker is that ZFS is not available on gokrazy. Unfortunately that’s not easy to change, in particular also from a licensing perspective."
I don't pretend to get how licensing works, but is OpenZFS and their licensing not an option here? I know its been really tricky with ZFS in general and I don't think its fully answered? but I'm not up to date with it.
An off the shelf NAS might be better. Some of them allow installing your own operating system such as TrueNAS. These boxes come with the right form factors, CPU, RAM, number of NICs, power consumption, etc.
Solutions such synology provide web interface, phone apps, and applications for syncing, photo management and backup. The units come with software for domain name, SSL certificates, monitoring, drive management, WoL, etc. There is a lot of software functionality useful for NAS built in.
Depending on the vendor you might end up with drives that run custom firmware, which cannot be crossflashed, that might not even work at all in a non-OEM system, that might not be compatible with normal block storage use, that might not support stuff like power management etc. (and allegedly might have reset SMART logs)
You can easily buy the correct price of hardware by searching the forums and following the advice. The Freenas/Truenas forum is full of hardware guides and threads on what 10gb NICs or controllers to buy and when you need it flashed into IT Mode.
Most of the resellers will answers questions or even flash firmware. Local Craigslist guys even offered to do it but they don’t know why you’re doing it, the eBay people understand.
This is really cool. I did a similar thing years ago with FreeBSD + 1TB WD Green drives on an HP Microserver. It had a AMD CPU similar to an Intel atom so power consumption was pretty low. Overall I was really happy with ZFS and it worked well. I ended up using jails to have it perform various tasks like using pf as a firewall.
I'd go with ECC RAM even if it costs a little bit more, same with an enterprise NVMe SSD. The latter is worth it in case of a power cut. It allows you to enable write cache and using the SSD as ZFS cache, massively increasing performance if you are using mechanical HDDs. This guy goes full SSD, but for the content he serves that isn't required at all.
Also, just have enough RAM, but don't disable swap. Put swap on an enterprise SSD but let Linux handle it. It will do so cleverly. Whereas with RAM only you cannot use swap at all. Minor disadvantage is you should encrypt your swap. But with modern hardware that shouldn't be a large penalty.
> This is because the iGPU doesn't support ECC RAM.
What, how? The only part of the chip that touches the ECC bytes should be the memory controller itself. The reads and writes going across the internal fabric should be exactly the same.
Also the PRO versions support ECC and seem to have the same iGPU.
That's not quite true. Power fluctuations can cause instability or crashes, either of which could lead to data corruption.
That said, it's much more likely that a faulty or low quality PSU will be the cause of that rather than the cable, and there's a lot already written out there about why a high quality PSU is important.
RAM prices went down quite a bit over the last year or so and a 32GB stick of 3200 Kingston RAM with ECC here is now 70-80€, not that much more expensive than non-ECC (about 60€ for 3200, more for faster modules).
I replaced 2 of the 16GB ECC sticks in my main PC with 32GB ones a while ago and put them in my NAS, will probably do the same with the remaining 2 sticks to max out my RAM.
I think most Zen CPUs support it, but only the PRO APUs.
Just upgraded my NAS a while ago after upgrading my main PC and replaced the 2200G with my old 1800X and some ECC ram, but also had to install an old GPU because that one doesn't have an iGPU.
1. A raspberry pi was picked as a single point failure point. This will be a headache within 24 months.
2. Daily power/thermal cycles might age some things more than a steady thermal load of always being on. Not a big deal and optimizing power consumption is totally reasonable, but a trade to track.
> When hardware breaks, I can get replacements from the local PC store the same day.
That wouldn't be my M.O. I don't want to go to a local PC store. The ones remaining are all either expensive, sell shit, or do both. I'd order from Amazon or whoever is cheapest according to a local price comparison website, and receive a replacement next day. It also costs less time having to do shopping (heck I even order groceries online, saving time). Besides, with warranty, it'd take longer than same day replacement.
Now, if the guy was running mission critical data solely I'd say 'OK', but nope:
> For over 10 years now, I run two self-built NAS (Network Storage) devices which serve media (currently via Jellyfin) and run daily backups of all my PCs and servers.
As such, it seems a silly requirement for the average content someone serves with Jellyfin.
I use 4x 4 TB HDDs in ZFS RAIDZ2 (running Ubuntu under Proxmox) with one enterprise grade SSD for OS and ZFS cache. All that Jellyfin data doesn't have to reside on any SSD. Not in this setup, and in most setups for home users: neither. You also don't require HA for such, to the point where even RAID is kind of silly.
I use restic for daily backups to a NAS in the same city, a Synology with RAID1 btrfs. But I don't back up any media served via Jellyfin. Seems silly, and I don't have fiber upload speed as of yet. The internet has backups of that. Usenet servers, for example.
None of these servers have a UPS, but in case a power cut the enterprise SSD (a Samsung NVMe costing more than 2x as much as a consumer version of same amount of storage) ensures the data is consistent. Both systems use FDE.
I also have an offsite, offline backup of our most important data. It is encrypted at rest but not I don't regularly rebase which is kind of stupid. But that is on me.
Either way, this fellow wants to have some kind of silent setup (for reasons) and yeah then SSDs make sense. But if you want value for GB/EUR or akin, even with the extremely low SSD prices of recent and even with the higher energy prices, then HDDs are still worth it. Especially for data as content served via Jellyfin.
I'm pondering about upgrading to more than 1 gbit for LAN but for now I don't see it is worth it. Also given my server (MicroServer 10 Plus) only has two PCIe and one is used for PCIe to NVMe and other for iLO I don't have the bandwidth available, a disadvantage of my current setup. I suppose I could use a USB to 2.5 gbit converter if USB could saturate it but then I'd use USB for high data transfers. It appears to me that because my switch is managed, pair bonding would be a better approach. Which the Synology also supports.
> with one enterprise grade SSD for OS and ZFS cache. All that Jellyfin data doesn't have to reside on any SSD
lately I've been experimenting with just putting the OS on spinning rust too. It seems like it's just so ingrained in us to put it on flash because of the order-of-magnitude gains in boot time, but for servers, is bootup time really a factor (damn supermicro motherboards take 1 minute+ post...)? Once your services are loaded to ram, how much does the OS actually hit the drives? If your OS supports ZFS-on-root, then you have one less point of failure.
Nah, boot time is irrelevant. If the boot time does matter one would use other measures. Like failover, redundancy, or k8s (though I don't have experience w/that as of yet).
I've had various consumer grade (but branded) USB sticks running Raspberry Pi OS, EdgeOS and Proxmox fail on me to the point where I use an USB to NVMe or USB to SATA. Why? I've tried industrial grade aMLC/MLC/SLC flash and performance is worse, and cost is high. For example, for Proxmox or EdgeOS you need at least 4 GB storage. They've never failed me but SATA or NVMe have high MTBF. On Raspberry Pis I've opted for log2ram on consumer grade (but branded) microSD with great effect. Another option is not log at all, or use e.g. rsyslog.
Edit: Looked again and he's getting redundancy by running multiple NASes and rsyncing between them. Still seems like a risky setup though.