/bk

Command-line backup system

Primary LanguageGoBSD 2-Clause "Simplified" LicenseBSD-2-Clause

Overview

bk is a tool for backing things up--both raw data streams and directory hierarchies. I wrote it because I wanted to have personal responsibility for my data's integrity, up to and including being responsible for data loss due to bugs in the backup system. You should probably use something else to back up your data--bup is a great choice.

That said, thanks to Google for letting me open source it.

Features

My goal was to implement the absolute minimum number of features necessary for my needs; the idea was that a minimal feature set (and in turn, a minimal number of lines of code) would reduce the probability of bugs (and in turn, the probability of data corruption).

  • Data de-duplication (using a rolling hash)
  • Compression (gzip)
  • Optional encryption (using Go's AES implementation).
  • Data integrity (and corruption recovery) using Reed-Solomon encoding.
  • Direct backups to cloud storage.
  • Ability to access backups via FUSE.

Usage

Set your BK_DIR environment variable either to a local directory or to a Google Cloud Storage bucket name of the form "gs://somebucketname".

To set up a backup repository, run:

% bk init

It's assumed that the target directory exists but is empty. To backup to Google Cloud Storage:

% env BK_GCS_PROJECT_ID=myproject-1234 bk init 

For an encrypted repository,

% env BK_PASSPHRASE=yolo bk init --encrypt

Though don't do it like that, since you don't want your passphrase in your shell command history.

To back up a directory hierarchy (e.g., your home directory):

% bk backup home ~

(BK_PASSPHRASE must be set if the repository is encrypted.) Here, the backup is named "home". bk adds the current date and time to the name of the backup; all available backups can be listed with "bk list".

Backups can be referred to via their full name and time as provided by "bk list"--e.g. "home@20170413104506". If just the base backup name is given ("home"), the the most recent backup with that base name used.

Incremental backups can be performed using the --base argument; the following uses the most recent backup from the set named "home" as the baseline.

% bk backup --base home home ~

Note that incremental backups only make backups run faster (by not scanning the contents of every file); there is no space benefit, since bk applies low-level deduplication to the data it stores.

To restore from a backup:

% bk restore home /tmp/restored

To mount all backups as a FUSE directory (if you have FUSE installed):

% bk mount /mnt

The resulting hierarchy has the structure "backup_name/year/month/day/hhmmss".

Run "bk help" for more information and additional commands.

Influences

  • Venti: A New Approach to Archival Storage, Sean Quinlan and Sean Dorward. Hash-based archival storage, from the Plan 9 project.
  • A Low-bandwidth Network File System, Athicha Muthitacharoen, Benjie Chen, and David Mazieres: rolling hashes to break up bitstreams.
  • bup: rolling hashes, hash-based archival storage, all wrapped up in git packfiles. bk's rolling hash code comes from bup.
  • Foundation: hash-based archival storage, revisiting some of Venti's design decisions, showed that rolling hashes (versus block-based archiving) weren't a big win.

In general, bup and foundation both go through some effort to provide efficient access to hash-addressed data without loading an entire index that goes from hashes to storage locations into memory. For my use of bk, the indices are a few hundred MB, so they're just all loaded at startup time. Note that this isn't an ideal approach when using cloud storage; something along the lines of Foundation's approach (or keeping a local cache of the index) would probably be better.

FAQs that no one has asked

Q: Wouldn't it be easier to just buy a Time Capsule?

A: Enjoy your "sparse bundle in use" errors that leave all of your backups corrupt and irrecoverable but aren't reported until you try to restore.

Q: Isn't most of this functionality provided by upspin?

A: It looks like it, especially as they implement the rest of the infrastructure for some of their key use cases.

Q: Why did you invent your own packfile format rather than using git's?

A: bk's pack files are simpler than git's (but don't have many of their advantages, like efficient lookups after just few seeks in index files, without needing to read them all into memory.) OTOH, bk uses SHAKE256 to hash data blobs into 32 bytes of hash. git's choice of SHA-1 now looks somewhat unfortunate, though for personal backups, this probably isn't something to worry much about.

Q: Why not use bup?

A: You should use bup. It has lots of users, which makes it less likely to have subtle bugs. I wrote bk for fun (Go is fun) and because I wanted to own responsibility for my bits. Also, bup doesn't directly support encryption or uploading directly to GCS.

Q: Your use of Check() and CheckError() isn't idiomatic Go error handling.

A: That's not a question. For a backup system, I believe that most errors should cause the system to immediately stop and fail obviously rather than make an attempt to recover (since the recovery code paths won't be well exercised and are thus likely to be buggy). Given this decision, I'd rather have those checks take a single line of code rather than three lines to test the error against nil and then panic.