BTRFS: Lessons Learned through Blood and Tears
Short Introduction
I don't try to hide that I love BTRFS. Over the past 10 years, I've stored hundreds of TiB of data on BTRFS, and BTRFS has been with me through disk corruption, disk failures, etc. To put this in context, in the same time frame, I've lost multiple arrays due to mdadm bugs, I've lost EXT4 volumes due to crashes, and don't get me started on FAT (*shudders*).
I find BTRFS incredibly stable and I've yet to lose data. It's so stable, I'm starting to move all my systems completely onto BTRFS (from EXT4). It's not all roses and hard drives though, BTRFS is a semantically different from what we've used before - this is my write-up of the lessons I've learned caring for BTRFS.
What is BTRFS?
BTRFS is in a class that I call: third-generation filesystems. These generations aren't a "real" designation, but it's where my mind goes when categorizing the different filesystems over time.
I'm conceptually skipping over a lot of history. Imagine there's a generation 0 somewhere.
First, we had simple first-generation filesystems (compared to modern filesystems at least), like FAT or EXT2. These were great because they standardized disk formats and introduced filesystem staples, a step in the right direction. But there were shortcomings. The glaring was being the lack of atomic write operations (power loss could corrupt data without a great way to reverse back to before the write operation was lost). The filesystem couldn't guarantee consistency.
Then came what I call the second-generation filesystems, such as NTFS or EXT3/EXT4 - all journaling based filesystems. Journaling was great, since it allowed for atomic-modifications in relative safety (not prefect, but much better). The journal keeps track of modifications, and if something bad happens, the filesystem could use the journal to reconstruct a more consistent state before the error occurred (say, power loss), which is awesome! There's a reason why most SQL databases tend to use some kind of a journal, although, typically much more complex (e.g. a binary log).
Now we are in what I call the third-generation, copy-on-write (COW) and bit-rot aware filesystems. Modern filesystems like ReFS (to replace NTFS on Windows), BTRFS, ZFS, etc. are all in this camp.
You might noticed that COW and bit-rot detection tend to go hand-in-hand in a lot of implementations. It turns out storing hashes of data blocks gets really difficult if modifications to the blocks are allowed later (you don't want to re-calculate the hash of an entire file, if you just append to the end, right?). The natural solution is to adopt a small amount of immutability.
And this is what COW does. Both the old state of a file being modified, and the new state, exist on the disk at the same time (hence, copy [the data before] the write). After the commit, then the old data can be removed (or in many cases, simply marked as removed). This guarantees data safety on a crash (assuming the disk isn't faking the status of write operations, some do). In many COW systems, there is no need for a journal (although, it can still be useful, e.g. replication).
Industry wide, there are endless examples of systems adopting COW semantics. Docker/OSI containers made heavy use of COW to reduce resource needs of many containers sharing the same image. PostgreSQL internally uses COW semantics, significantly reducing the complexity of transaction rollback (don't you love nearly instantaneous rollback? Looking at you, only-semi-ACID-compliant-MySQL...).
And there's a reason, COW naturally makes a lot of cool features possible. For example:
- As I mentioned, bit-rot detection and general data crash safety.
- Offline file deduplication - multiple files can safely point to the same logical chuck of data (like a hard link, without hard link aliasing).
- Instantaneous and fully consistent snapshots or file cloning.
- Better support for zoned storage (those annoying SMR disks).
- And many more!
But not everything is perfect, like many things, COW has problems. In software engineering, we tend to like making data structures immutable. We like this because of the safety it provides, but it also can make problems regarding performance. BTRFS naturally has the same theoretical problems, as well as completely new problems that filesystems traditionally didn't have before it.
6 Lessons Learned
So here are the lessons!
BTRFS Is Flexible
Lesson #1: It's okay to change your mind later.
Nearly every feature in BTRFS can be migrated to, Meaning you can start off with a single drive, and then use RAID1 or RAID0 later. You can disable or enable quotas at any time, or modify COW semantics anywhere. Even stuff like compression algorithms can be changed.
And all of this online!
I can't stress how awesome this is. The last I tried to change the layout of an mdadm array, I had to restore a 40TiB server from backups (read, I don't trust mdadm). For BTRFS, I can switch between RAID1 and RAID10 completely online, and I trust BTRFS to not eat my data.
BTRFS's Failure Mode
Lesson #2: BTRFS might care more about data safety, than you do.
When BTRFS fails, BTRFS tends to fail-safe by entering a read-only mode. At face value, this seems awesome, failing in a data-consistent way, a hallmark of a good system. It can be annoying though. For example, if a BTRFS array, in RAID1 mode, becomes degraded (say a hard drive stops responding), you have one-shot at fixing it. The old BTRFS wiki says this:
Even if there are no single profile chunks, raid1 volumes if they become degraded may only be mounted read-write once with the options
-o degraded,rw
. Source
Due to this, a server failure involved me copying the read-only data over to a new BTRFS volume, as I accidentally rebooted twice. Annoying is a understatement, especially when you don't have the hard drive spares... Sometimes you just want things to work, regardless of data durability.
Thankfully, I haven't ran into this in a while.
Care for BTRFS Responsibly
Lesson #3: You need to care and nurture BTRFS. Preventive maintenance is recommended if your data is critical.
In some systems, say Ceph, the system takes care of you, by actively performing maintenance functions, like monitoring for bit-rot or rebalancing. BTRFS doesn't do much proactively, which can be a good thing or a bad thing. BTRFS tends to wear a lot of hats, you might see it used for interactive workstation usage, embedded IOT devices, or enterprise mass-storage - all with different needs. This bare-bones approach means, BTRFS can run on basically any system, but it also relies on you to care for it's needs.
That said, BTRFS has sane defaults that work in 98% of environments. BTRFS self-heals if it encounters corrupt data (assuming there's another copy), can automatically detect SSD's for SSD optimizations, and normally doesn't need to be rebalanced.
If your data is important, it is your responsibility to setup recurring scrubs at the time interval that is best for you, using the btrfs scrub
command. So either a crontab or a Systemd timer. Scrubs will both detect issues, as well as automatically fix them (if possible).
I use this as systemd unit:
Note that IO scheduling class only works if your IO schedular supports it - the normally default
mq-deadline
schedular does not support IO priorities, but something likebfq
does.
On that same subject, there is no active notification when BTRFS finds an issue. You are responsible for setting up monitoring. You can use btrfs device stats
to get this information, I recommend the -c
flag to return an exit code based on the results.
The output looks something like this:
If any of these values are non-zero, you likely have a hardware problem. As always, monitor SMART if possible.
Morgan, from Doing Stuff, mentioned kdave/btrfsmaintenance, to get new BTRFS users started with a good reference of what scripts you can have in your maintenance toolbox.
COW Exacerbates Fragmentation
Lesson #4: You might need to monitor or actively prevent fragmentation, depending on your workload.
I think the most commonly known issue with BTRFS is the dreaded fragmentation problem. This is completely related to the COW semantics, so not just a problem with BTRFS (see ZFS fragmentation). To my knowledge, only ReFS seems to avoid this issue, but no one really seems to know how, given the closed source nature of the filesystem.
Ultimately many modifications to the same large file can cause significant data fragmentation where parts of the same file might be spewed across the physical disk. Not only is this bad for seeking operations (assuming spinning rust), but it can also increase storage waste, due to the metadata overhead and reduced data density.
This is largely not an issue for the vast majority of use-cases, and might never actually become a problem. You can kind of estimate how bad fragmentation is if you use something like filefrag
I just wouldn't use the output for anything other than a barometer. If this number is in the 1000's, you likely have a problem, but possibly not a problem that needs to be fixed.
If you need to solve fragmentation I think there's basically two schools of thought:
First, you don't let fragmentation happen by disabling COW completely. This works great for things like VM disks, where it's a good idea to tell BTRFS it's allowed to do in-place modifications (helps performance, prevents fragmentation, but decreases data durability on a crash).
Or you let fragmentation happen, and defragment on an interval (or manually, as needed). This is perfect for those backups that eventually become read-only.
But again, you likely don't need to defragment.
BTRFS Sizes Differently
Lesson #5: DF only shows part of a truth.
It's really common for Linux users to just run something like df -h
on the command line to get an idea on free disk space. That only works maybe 95% of the time for BTRFS, as df
only tells part of the story.
In traditional filesystems, disk use was measured by the number of bytes on the disk (ignoring sparse volumes, hard links, junctions, and other shenanigans). With COW systems, there is no longer a direct correlation between between the size of all your files, and actual disk usage. You can have 100TiB of files, all reflink'ed together, and only take up 2GiB of actual storage space. And since BTRFS supports compression, you can fit even more bytes onto your disk. I once compressed a 80GiB SQL Server database (running on Linux) using BTRFS compression down to 5GiB (way better than what SQL Server's native page compression was estimating it could do).
Another example, I once had a routine defragmentation eat over 800GiB of the available capacity on execution. This was unexpected, but thankfully, the defragmentation was canceled when that volume reached it's accounting quota (that was also unexpected, but a pleasant surprise. Seeeee, BTRFS cares). This wasn't due to reflinks being unmerged, but because recent deletions caused unbalanced disks - which caused BTRFS to artificially lower the available capacity to maintain data safety (I might be wrong here, I didn't spend a huge amount of time debugging this one). Running a balance returned free disk-space to what I expected.
So ultimately, a lot of variables go into how much capacity exists and is ultimately used. When in doubt, it's best to ask BTRFS directly:
It's not Complete, But That's Okay
Lesson #6: BTRFS is not feature complete, but it keeps on getting better.
What originally drew me to BTRFS was that it was in-tree, which is a blessing and a curse. On one side, you have rock solid stability and a migration path. On the other hand, BTRFS may not be as nimble as other filesystems who work out-of-tree.
In-tree and out-of-tree is referring to if the code is in Linus Torvalds's repo at-the-end-of-the-figurative-day. There is a out-of-tree BTRFS repo if you want the latest, but I think most of us use the version supplied by the kernel.
The go to example for this "development slowness" is BTRFS RAID5/6 stability. Currently it's rated as "mostly OK" (read, don't store important stuff on it yet), but historically, RAID5/6 had fundamental issues that made storing anything with RAID5/6 dangerous. Will RAID5/6 support ever be fully fixed? I don't know, although there are desires from BTRFS developers on a rewrite. It might say more that even with the issues with RAID5/6, BTRFS hasn't just nixed the code.
Even though BTRFS development hasn't been quick, BTRFS continues to make steady progress on new features, while retaining, in my experience, nearly perfect stability . The developers are completely honest on what is stable and what is not and the on-disk layout appears to have stabilized. All of these things are incredibly healthy for a new generation filesystem - where data durability and trust is king.
Not only is BTRFS improving over time, but application support is also getting better. It's not uncommon these days for applications to directly support COW filesystem semantics directly. The .NET runtime can detect that file copying can be done using reflinks, which avoid the copy from taking disk space until the copy is modified. .NET isn't alone here, many other open source projects have custom handling for COW too.
What I found really cool was that Docker/ContainerD have native support for BTRFS (similar to the traditional overlay2 approach):
Summary
BTRFS is a member of the next generation of COW filesystems, and is likely to replace XFS and EXT4 as the default. So far, we've seen "official" adoption from the likes of OpenSUSE, Fedora, Mint, and Ubuntu. I also see industry trends valuing immutable distributions (Flatcar Linux) and distributions that support atomic upgrades (e.g. NixOS) benefiting greatly from the features that BTRFS provides.
BTRFS is also incredibly flexible and can run in places that heavier filesystems, like ZFS, just can't. Today, BTRFS looks like a good choice for workstation usage and archival storage. I think tomorrow is bright for BTRFS as adoption improves.
If you haven't messed with BTRFS, now's a good time to start!