Btrfs (pronounced as "better F S", "butter F S", "b-tree F S", or "B.T.R.F.S.") is a computer storage format that combines a file system based on the copy-on-write (COW) principle with a logical volume manager (distinct from Linux's LVM), developed together. It was created by Chris Mason in 2007 for use in Linux, and since November 2013, the file system's on-disk format has been declared stable in the Linux kernel.
Btrfs is intended to address the lack of pooling, snapshots, integrity checking, data scrubbing, and integral multi-device spanning in Linux file systems. Mason, the principal Btrfs author, stated that its goal was "to let Linux scale for the storage that will be available. Scaling is not just about addressing the storage but also means being able to administer and to manage it with a clean interface that lets people see what's being used and makes it more reliable".
In 2008, the principal developer of the ext3 and ext4 file systems, Theodore Ts'o, stated that although ext4 has improved features, it is not a major advance; it uses old technology and is a stop-gap. Ts'o said that Btrfs is the better direction because "it offers improvements in scalability, reliability, and ease of management". Btrfs also has "a number of the same design ideas that ReiserFS/4 had".
Btrfs 1.0, with finalized on-disk format, was originally slated for a late-2008 release, and was finally accepted into the Linux kernel mainline in 2009. Several Linux distributions began offering Btrfs as an experimental choice of root file system during installation.
In July 2011, Btrfs automatic defragmentation and scrubbing features were merged into version 3.0 of the Linux kernel mainline. Besides Mason at Oracle, Miao Xie at Fujitsu contributed performance improvements. In June 2012, Mason left Oracle for Fusion-io, which he left a year later with Josef Bacik to join Facebook. While at both companies, Mason continued his work on Btrfs.
In 2012, two Linux distributions moved Btrfs from experimental to production or supported status: Oracle Linux in March, followed by SUSE Linux Enterprise in August.
In 2015, Btrfs was adopted as the default file system for SUSE Linux Enterprise Server (SLE) 12.
In August 2017, Red Hat announced in the release notes for Red Hat Enterprise Linux (RHEL) 7.4 that it no longer planned to move Btrfs to a fully supported feature (it's been included as a "technology preview" since RHEL 6 beta) noting that it would remain available in the RHEL 7 release series. Btrfs was removed from RHEL 8 in May 2019. RHEL moved from ext4 in RHEL 6 to XFS in RHEL 7.
In 2020, Btrfs was selected as the default file system for Fedora Linux 33 for desktop variants.
cp --reflink <source file> <destination file>
By cloning, the file system does not create a new link pointing to an existing inode; instead, it creates a new inode that initially shares the same disk blocks with the original file. As a result, cloning works only within the boundaries of the same Btrfs file system, but since version 3.6 of the Linux kernel it may cross the boundaries of subvolumes under certain circumstances. The actual data blocks are not duplicated; at the same time, due to the copy-on-write (CoW) nature of Btrfs, modifications to any of the cloned files are not visible in the original file and vice versa.
Cloning should not be confused with hard links, which are directory entries that associate multiple file names with a single file. While hard links can be taken as different names for the same file, cloning in Btrfs provides independent files that initially share all their disk blocks.
Support for this Btrfs feature was added in version 7.5 of the Coreutils, via the '/' option to the NUL command.
In addition to data cloning (), Btrfs also supports out-of-band deduplication via . This functionality allows two files with (even partially) identical data to share storage.
Subvolumes can be created at any place within the file system hierarchy, and they can also be nested. Nested subvolumes appear as subdirectories within their parent subvolumes, similarly to the way a top-level subvolume presents its subvolumes as subdirectories. Deleting a subvolume is not possible until all subvolumes below it in the nesting hierarchy are deleted; as a result, top-level subvolumes cannot be deleted.
Any Btrfs file system always has a default subvolume, which is initially set to be the top-level subvolume, and is mounted by default if no subvolume selection option is passed to cp. The default subvolume can be changed as required.
A Btrfs snapshot is a subvolume that shares its data (and metadata) with some other subvolume, using Btrfs' copy-on-write capabilities, and modifications to a snapshot are not visible in the original subvolume. Once a writable snapshot is made, it can be treated as an alternate version of the original file system. For example, to roll back to a snapshot, a modified original subvolume needs to be unmounted and the snapshot needs to be mounted in its place. At that point, the original subvolume may also be deleted.
The copy-on-write (CoW) nature of Btrfs means that snapshots are quickly created, while initially consuming very little disk space. Since a snapshot is a subvolume, creating nested snapshots is also possible. Taking snapshots of a subvolume is not a recursive process; thus, if a snapshot of a subvolume is created, every subvolume or snapshot that the subvolume already contains is mapped to an empty directory of the same name inside the snapshot.
Taking snapshots of a directory is not possible, as only subvolumes can have snapshots. However, there is a workaround that involves reflinks spread across subvolumes: a new subvolume is created, containing cross-subvolume reflinks to the content of the targeted directory. Having that available, a snapshot of this new volume can be created.
A subvolume in Btrfs is quite different from a traditional Logical Volume Manager (LVM) logical volume. With LVM, a logical volume is a separate block device, while a Btrfs subvolume is not and it cannot be treated or used that way. Making dd or LVM snapshots of btrfs leads to data loss if either the original or the copy is mounted while both are on the same computer.
The send/receive feature can be used with regularly scheduled snapshots for implementing a simple form of file system replication, or for the purpose of performing incremental backups.
Quota groups only apply to subvolumes and snapshots, while having quotas enforced on individual subdirectories, users, or user groups is not possible. However, workarounds are possible by using different subvolumes for all users or user groups that require a quota to be enforced.
The conversion involves creating a copy of the whole ext2/3/4 metadata, while the Btrfs files simply point to the same blocks used by the ext2/3/4 files. This makes the bulk of the blocks shared between the two file systems before the conversion becomes permanent. Thanks to the copy-on-write nature of Btrfs, the original versions of the file data blocks are preserved during all file modifications. Until the conversion becomes permanent, only the blocks that were marked as free in ext2/3/4 are used to hold new Btrfs modifications, meaning that the conversion can be undone at any time (although doing so will erase any changes made after the conversion to Btrfs).
All converted files are available and writable in the default subvolume of the Btrfs. A sparse file holding all of the references to the original ext2/3/4 file system is created in a separate subvolume, which is mountable on its own as a read-only disk image, allowing both original and converted file systems to be accessed at the same time. Deleting this sparse file frees up the space and makes the conversion permanent.
In 4.x versions of the mainline Linux kernel, the in-place ext3/4 conversion was considered untested and rarely used. However, the feature was rewritten from scratch in 2016 for btrfs receive 4.6. and has been considered stable since then.
In-place conversion from ReiserFS was introduced in September 2017 with kernel 4.13.
the developers were working to add keyed hash like [[HMAC]] (SHA256).
There is another tool, named btrfs-progs, that can be used to recover files from an unmountable file system, without modifying the broken file system itself (i.e., non-destructively).
In normal use, Btrfs is mostly self-healing and can recover from broken root trees at mount time, thanks to making periodic data flushes to permanent storage, by default every 30 seconds. Thus, isolated errors will cause a maximum of 30 seconds of file-system changes to be lost at the next mount. This period can be changed by specifying a desired value (in seconds) with the btrfs check mount option.
At Oracle later that year, Mason began work on a snapshot-capable file system that would use this data structure almost exclusively—not just for metadata and file data, but also recursively to track space allocation of the trees themselves. This allowed all traversal and modifications to be funneled through a single code path, against which features such as copy on write, checksumming and mirroring needed to be implemented only once to benefit the entire file system.
Btrfs is structured as several layers of such trees, all using the same B-tree implementation. The trees store generic items sorted by a 136-bit key. The most significant 64 bits of the key are a unique object id. The middle eight bits are an item type field: its use is hardwired into code as an item filter in tree lookups. Objects can have multiple items of multiple types. The remaining (least significant) 64 bits are used in type-specific ways. Therefore, items for the same object end up adjacent to each other in the tree, grouped by type. By choosing certain key values, objects can further put items of the same type in a particular order.
Interior tree nodes are simply flat lists of key-pointer pairs, where the pointer is the logical block number of a child node. Leaf nodes contain item keys packed into the front of the node and item data packed into the end, with the two growing toward each other as the leaf fills up.
Files with hard links in multiple directories have multiple reference items, one for each parent directory. Files with multiple hard links in the same directory pack all of the links' filenames into the same reference item. This was a design flaw that limited the number of same-directory hard links to however many could fit in a single tree block. (On the default block size of 4 KiB, an average filename length of 8 bytes and a per-filename header of 4 bytes, this would be less than 350.) Applications which made heavy use of multiple same-directory hard links, such as git, Gnus, MAME and BackupPC were observed to fail at this limit. The limit was eventually removed (and as of October 2012 has been merged pending release in Linux 3.7) by introducing spillover extended reference items to hold hard link filenames which do not otherwise fit.
The item's key offset is the starting byte of the extent. This makes for efficient seeks in large files with many extents, because the correct extent for any given file offset can be computed with just one tree lookup.
Snapshots and cloned files share extents. When a small part of a large such extent is overwritten, the resulting copy-on-write may create three new extents: a small one containing the overwritten data, and two large ones with unmodified data on either side of the overwrite. To avoid having to re-write unmodified data, the copy-on-write may instead create bookend extents, or extents which are simply slices of existing extents. Extent data items allow for this by including an offset into the extent they are tracking: items for bookends are those with non-zero offsets.
The file system divides its allocated space into block groups which are variable-sized allocation regions that alternate between preferring metadata extents (tree nodes) and data extents (file contents). The default ratio of data to metadata block groups is 1:2. They are intended to use concepts of the Orlov block allocator to allocate related files together and resist fragmentation by leaving free space between groups. (Ext3 block groups, however, have fixed locations computed from the size of the file system, whereas those in Btrfs are dynamic and created as needed.) Each block group is associated with a block group item. Inode items in the file system tree include a reference to their current block group.
Extent items contain a back-reference to the tree node or file occupying that extent. There may be multiple back-references if the extent is shared between snapshots. If there are too many back-references to fit in the item, they spill out into individual extent data reference items. Tree nodes, in turn, have back-references to their containing trees. This makes it possible to find which extents or tree nodes are in any region of space by doing a B-tree range lookup on a pair of offsets bracketing that region, then following the back-references. For relocating data, this allows an efficient upwards traversal from the relocated blocks to quickly find and fix all downwards references to those blocks, without having to scan the entire file system. This, in turn, allows the file system to efficiently shrink, migrate, and defragment its storage online.
The extent allocation tree, as with all other trees in the file system, is copy-on-write. Writes to the file system may thus cause a cascade whereby changed tree nodes and file data result in new extents being allocated, causing the extent tree itself to change. To avoid creating a feedback loop, extent tree nodes which are still in memory but not yet committed to disk may be updated in place to reflect new copied-on-write extents.
In theory, the extent allocation tree makes a conventional free-space bitmap unnecessary because the extent allocation tree acts as a B-tree version of a BSP tree. In practice, however, an in-memory red–black tree of page-sized bitmaps is used to speed up allocations. These bitmaps are persisted to disk (starting in Linux 2.6.37, via the --repair mount option) as special extents that are exempt from checksumming and copy-on-write.
There is one checksum item per contiguous run of allocated blocks, with per-block checksums packed end-to-end into the item data. If there are more checksums than can fit, they spill into another checksum item in a new leaf. If the file system detects a checksum mismatch while reading a block, it first tries to obtain (or create) a good copy of this block from another device if internal mirroring or RAID techniques are in use.
Btrfs can initiate an online check of the entire file system by triggering a file system scrub job that is performed in the background. The scrub job scans the entire file system for integrity and automatically attempts to report and repair any bad blocks it finds along the way.
The chunk tree tracks this by storing each device therein as a device item and logical chunks as chunk map items, which provide a forward mapping from logical to physical addresses by storing their offsets in the least significant 64 bits of their key. Chunk map items can be one of several different types:
Superblock mirrors are kept at fixed locations: 64 KiB into every block device, with additional copies at 64 MiB, 256 GiB and 1 PiB. When a superblock mirror is updated, its generation number is incremented. At mount time, the copy with the highest generation number is used. All superblock mirrors are updated in tandem, except in SSD mode which alternates updates among mirrors to provide some wear levelling.
|
|