J. David Beutel 2011-11-13
I'll make a file system, named Babyfs, with a design similar to Btrfs, but simplified and limited to the requirements of project 6.
Babyfs will have two Rodeh b-trees: one for the file system, and one for the allocation of extents of blocks on the device. The FS tree contains metadata: inodes, directories, and pointers to extents of file data. The extent tree allocates blocks on the device for holding file data and metadata (the trees).
The superblock has a pointer to the root block of each tree. To guarantee integrity in case of a loss of power or communication, all modifications to data or trees are shadowed: no blocks on the device are overwritten; instead, changes are written to free blocks, shadowing the changed paths of the trees and parts of the file. Finally, the superblock is overwritten, to point to the root blocks of the modified versions of the trees.
The balanced trees described by Rodeh are b-plus trees, where the metadata is kept in the leaf nodes, and the inner nodes contain a balanced index of keys. Rodeh makes them "top-down" b-trees, suitable for shadowing, by eliminating the leaf chaining convention, and by proactively splitting or fixing index nodes that contain too many or few keys when inserting or deleting, respectively, along the path. The key count boundary is b..3b where b >= 2, e.g., 2..6.
Each node in a tree is one block (1024 bytes), and starts with
header
(16 bytes). Index nodes can contain up to 63 key_ptr
structures (16 bytes), so the boundary is 20..60, although the
tree operations can be tested with the minimal boundary, 2..6.
The keys are kept sorted in the index, and the pointer is the
block number of the node containing that key (or the next larger
key that exists in the tree). The lower bounds doesn't apply
to a root node, which is never fixed.
The maximum depth of the index nodes is 5, based on the lower bounds (20^5 = 3,200,000), providing a key for each block (maximum 2^21 = 2,097,152). (I'm not sure if the upper bounds is sufficent for this, but if so, then the maximum depth will be 4, 60^4 = 12,960,000.) This provides enough keys for the extent tree in the degenerate case of all tree nodes and 1-block files, and more than enough for the file system tree to hold up to 10,000 files and directories.
Leaf nodes contain 1..63 item
structures (16 bytes)
stacked after the header
, paired with variable sized metadata
for each item, stacked from the end of the block inward (growing
toward the item structures). An item contains its key
and the size and offset of its metadata within the block (in bytes,
if any).
The keys consist of an objectid, type code, and offset. They are sorted in that order, so keys with the same objectid are together in the index. The metadata associated with a key, if any, depends on the type of the key.
objectid: inode number
type:
INODE - offset: 0
DIR_ENT - offset: hash of the name of this directory entry
FILE_EXTENT - offset: starting offset into the file (in blocks)
objectid: block number
type: SUPERBLOCK, EXT_IDX, EXT_LEAF, FS_IDX, FS_LEAF, FILE_DATA
offset: number of blocks in extent (1 for a tree node, or N for a file)
objectid: 0
type: INODE, offset: 0 -> inode_type: DIR, ctime: 2011-11-10
type: DIR_ENT, offset: 987654 -> name: "foo", inode: 1
objectid: 1
type: INODE -> inode_type: FILE, ctime: 2011-11-11
type: FILE_EXTENT, offset: 0 -> blocknr: 5, size: 42500
objectid: 0, type: SUPERBLOCK, offset: 1
objectid: 1, type: EXT_IDX, offset: 1
objectid: 2, type: EXT_LEAF, offset: 1
objectid: 3, type: FS_IDX, offset: 1
objectid: 4, type: FS_LEAF, offset: 1
objectid: 5, type: FILE_DATA, offset: 42
(these are out of date; see the actual code)
struct superblock {
u32 magic_number; /* 0xaaa1babf */
u32 extent_tree_blocknr;
u32 fs_tree_blocknr;
u32 total_blocks;
}
struct header {
u32 blocknr;
u8 flags;
u8 nritems;
u8 level;
}
struct key {
u32 objectid;
u8 type;
u32 offset;
}
struct key_ptr {
struct key key;
u32 blocknr;
}
struct item {
struct key key;
u16 offset;
u16 size;
}
struct inode_metadata {
u8 inode_type;
time_t ctime;
time_t mtime;
}
struct dir_ent_metadata {
u32 inode;
char[201] name; /* todo: dynamic size? */
}
struct file_extent_metadata {
u32 blocknr;
u32 size; /* in bytes */
}
When the FS tree or a file allocate blocks in the extent tree, it may trigger a cascade, if the extent tree needs to then allocate more blocks for itself. I will try to avoid a Catch-22 here by buffering all the changed blocks in memory, letting them change freely throughout the operation, and only writing them out at the end of the operation, like a little transaction. (This could get tricky.)
Diagrams here and here look like they would be useful, but they were lost in a kernel.org hacking incident and have not been recovered yet.
Babyfs will have the following similarities to Btrfs:
- uses Rodeh b-trees for file system and extent allocation,
- with Btrfs-like keys, items, leaves, and block headers,
- and shadowing updates to guarantee integrity.
However, as for simplifications and limitations with respect to Btrfs, Babyfs will not support the following:
- subvolumes, snapshots, clones, copy-on-write,
- reference counts, back-references, defragmentation, file system resizing,
- transactions, generation numbers, superblock mirrors,
- checksums on data or metadata (no checksum tree),
- journaling (no log tree), buffering (beyond one operation),
- multiple devices (no chunk or device trees), RAID,
- block sizes other than 1024 bytes,
- in-line data extents,
- SSD wear leveling,
- encryption, compression,
- seeks,
- locks, concurrent access (multi-thread/process),
- special (device) files,
- device size over 2GB,
- file size over 200MB,
- file protection (ownership, group, and access permissions),
- file access control lists and extended attributes,
- file access times (but will support create and modify times),
- fstat (so no way to see create and modify times),
- links (both hard and soft),
- current directories (and parent directory back-links),
- over 10 files open at the same time,
- machine-independent disk format (little- versus big-endian).