/ostfs

Toy filesystem with support for zero-cost snapshots and clones

Primary LanguageRustMIT LicenseMIT

OstFS 🧀

OstFS is a toy FUSE filesystem with support for zero-cost¹ snapshots and clones (think ZFS / Btrfs):

# (example requires `rustup`, since OstFS is a Rust application)

# Create a file for storing the filesystem:
$ ./ost create tank.ofs

# Mount it:
$ mkdir mnt
$ ./ost mount tank.ofs mnt &

# (`&` causes the process to run in background - press
#  the enter key to bring back your shell prompt)

# Create a file *inside* the filesystem:
$ echo 'Hello, World!' > mnt/hello.txt

# Make a snapshot (aka a read-only clone) called `init`:
$ ./ost clone create tank.ofs init --read-only

# </>
# What's cool is that creating a clone doesn't actually
# copy the underlying data - rather, OstFS tracks which
# files have changed *since* the clone and stores only
# the diff (with a pinch of salt).
#
# That is to say, creating a clone is aaalmost¹ a no-op.
# </>

# Modify the file:
$ echo 'nobody expects the spanish inquisition' > mnt/hello.txt

# ---
# Alright, now we can utilize the snapshot to time-travel!
# ---

# Umount filesystem:
$ umount mnt

# Invoke `./ofs mount` again, but instructing it to mount
# the snapshot instead of the top filesystem:
$ ./ost mount tank.ofs mnt --clone init &

# Ha, ha!
$ cat mnt/hello.txt
Hello, World!

# Note that mounting a clone doesn't rollback the changes -
# if you remounted without `--clone init`, `cat mnt/hello.txt`
# would say `nobody expects ...` again.

¹ terms and conditions apply, see below

Usage

See the command line:

$ ./ost

... but it mostly boils down to ./ost create & ./ost mount.

Architecture

OstFS stores everything in a tree structure which starts from the header and then descends into clones, the root directory and its children.

Something that ls displays as:

drwxrwxrwx   1 PWY  staff    0 Jan  1  1970 .
drwxr-xr-x  15 PWY  staff  480 Mar  8 20:49 ..
-rw-r--r--   1 PWY  staff   12 Jan  1  1970 one.txt
-rw-r--r--   1 PWY  staff   12 Jan  1  1970 two.txt

... would internally be stored as a couple of objects forming a tree:

       / ------ \
       | header |
       \ ------ /
            |
            |
        has child
            |
            |
            v
        / ----- \                / ---------- \
        | entry | -- has name -> | payload(/) |  # root note starts at `/`
        \ ----- /                \ ---------- /  # (just a convention)
            |
            |
        has child
            |
            |
            v
        / ----- \                / ---------------- \
        |       | -- has name -> | payload(one.txt) |
        |       |                \ ---------------- /
        | entry |
        |       |                / -------------------- \
        |       | -- has body -> | payload(Hello, One!) |
        \ ----- /                \ -------------------- /
            |
            |
       has sibling
            |
            |
            v
{ similar stuff for two.txt }

There are three basic types of objects:

  • header (represents the entrypoint - links to the root entry, aka root directory),
  • entry (represents either a file or a directory),
  • payload (represents a string (e.g. entry name) or binary data (e.g. file content)).

Each object is assigned a unique identifier (starting from zero and going up) and each object links to other objects using their identifiers as well - this can be observed through the ./ofs inspect command:

[0] = Header(HeaderObj { root: ObjectId(29), clone: None, dead: None })
[1] = Entry(EntryObj { name: ObjectId(2), body: None, next: None, kind: Directory, size: 0, mode: 511, uid: 502, gid: 20 })
[2] = Payload(Payload { size: 1, next: None, data: "/" })
[3] = Payload(Payload { size: 7, next: None, data: "one.txt" })
/* ... */

Now, all this keeping track of the objects, their edges and so on is a huge amount of work, so it's reasonable to ask: what do we gain?

Well, we can become copy on write!

That is, when you modify a file (change its attributes, content, rename it etc.), its original object remains intact - rather, OstFS duplicates that object (with the duplicate having a new identifier), then duplicates its parent, grandparent etc., up until and including the root.

But we don't duplicate everything - if only the file name has changed, there's no point in duplicating payload containing the file's body, so if we did:

$ mv one.txt one.md

... the graph would say:

 /* ... */

        / ----- \                / ---------------- \
        |       | -- has name -> | payload(one.txt) |
        |       |                \ ---------------- /
        | entry |
        |       |                / -------------------- \
        |       | -- has body -> | payload(Hello, One!) |
        \ ----- /                \ -------------------- /
                                                      |
        / ----- \                / --------------- \  |
        |       | -- has name -> | payload(one.md) |  |
        |       |                \ --------------- /  |
        | entry |                                     |
        |       |                                     |
        |       | -- has body ----------------------- /
        \ ----- /

 /* ... */

- payload(one.txt) has one parent
- payload(one.md) has one parent
- payload(Hello, One!) has two parents

Similarly, given a structure like:

     a
   /   \
  b     c
 / \   / \
d   e f   g

... if we modified e, we'd only need to duplicate b and a - the rest could be linked as-is.

This is what makes (almost) zero-cost snapshots (almost) zero-cost - because we don't modify objects in-place, we can reuse this fact to time-travel back to the past, if only we can get our hands on the past object ids (and if we don't remove those past objects, of course).

What's more, this also allows for perfectly safe atomic updates! -- remember the header object?

Header is always located at the object slot 0 (i.e. the beginning of the file) and the most important information it contains is the reference (object id) of the root directory. Initially the root directory starts at slot 1 (right after the header), but as soon as the filesystem gets modified, we generate a new root directory (with a brand new object id), which requires updating the header.

Here's the second greatest part of using copy on writes:

If the power goes down when we're building the new tree, nothing gets accidentally removed/updated! (that is, it's not possible to observe a partial update)

See, since we update the header at the very end of the process (after we've built the entire tree), the only two possible options are:

  • power went down before we've manged to write the header, in which case upon the next mount the filesystem will effectively rollback to its previous state (because we didn't overwrite the older objects),

  • power went down after we've managed to write the header, in which case we're 100% sure the entire new tree is in place, because we update the header last.

(well, it's technically also possible for the power to go down during the header update and that's why ZFS uses the concept of uberblock, but let's not go that deep!)

If you're into that, you might find this interesting:

https://www.youtube.com/watch?v=NRoUC9P1PmA

Limitations

  • OstFS uses 32-byte objects, which makes it easy to understand, but also impractical (doesn't match any typical sector size)
  • OstFS uses linked lists instead of b-treemaps & hashmaps, so a directory with 10k entries will open noticeably slower than a directory with 10 entries
  • OstFS keeps inodes entirely in memory
  • OstFS doesn't store checksums
  • OstFS doesn't store atime, ctime and a couple of other properties
  • OstFS has no caching layer
  • OstFS is a toy - I made it solely to learn a few cool concepts; if you're looking for actual filesystem, ZFS is your friend!

(most - if not all - of this things could get improved -- it's just that I just wanted to hack something over three days, not thirty years)

Hacking

Code is a bit crude, but if you'd like to take a look, here's a couple of useful entrypoints:

  • object.rs
  • objects.rs
  • ops.rs
  • filesystem.rs (in particular the modules inside of it)

License

MIT License

Copyright (c) 2024 Patryk Wychowaniec