/gitbackup

Backup and archive of Github repositories.

Primary LanguageJavaScriptGNU Affero General Public License v3.0AGPL-3.0

gitbackup

Build Status Coverage Status

We backup and archive GitHub.

GitBackup was built at Storj by @super3, @montyanderson, and @calebcase.

Design

We have a single central server exposing a REST API used by both the user interface and by workers.

Workers operate statelessly and can be scaled, limited only by the central server's ability to provision work.

Storj serves as our durable store for all data and metadata. Redis will serve as the store for ephmerical data and data cached for speed reasons.

  • Locks and last sync time in Redis (per username)
  • Everything else in Storj (usernames, repos, last sync, repo count, etc)

Storj

The durable store needs to support the following operations:

  • Listing usernames
  • Getting the last sync time for a username
  • Getting the repository count for a username
  • Listing a user's repositories
  • Getting the last update time for a repository
  • Getting the last error for a repository

To avoid directories with a very large number of entries the paths will be constructed with a hash prefix.

The general layout scheme:

bucket sha256sum(username)[:8] username repository archive
github.com/ 2b/cb/c2/d5/ octocat/ Hello-World.bundle
github.com/ 2b/cb/c2/d5/ octocat/ Hello-World.zip
github.com/ 2b/cb/c2/d5/ octocat/ Hello-World.error

For example, the ZIP archive of https://github.com/octocat/Hello-World would be located at: github.com/2b/cb/c2/d5/octocat/Hello-World.zip

Sharding

The data will be sharded across all production satellites to maximize our total throughput and available storage. The sharding will be done per user based on the first byte of the sha256sum of the username and then equally split among the satellites.

Sharing allocations with our current satellites:

satellite min max
asia-east-1 00 55
europe-west-1 56 aa
us-central-1 ab ff

Listing a user's repositories

rclone ls 'asia-east-1:github.com/2b/cb/c2/d5/octocat/'

Getting the last update time for a repository

rclone ls 'asia-east-1:github.com/2b/cb/c2/d5/octocat/Hello-World.bundle'

Getting last error for a repository

rclone cat 'asia-east-1:github.com/2b/cb/c2/d5/octocat/Hello-World.error'

Redis

Locks

Locks are stored as normal Redis keys with a TTL as described by Redlock. The lock must be refreshed by the worker before it expires. For example, if locks expire every 10 seconds, the worker should attempt to relock after 5 seconds.

Initially getting the lock:

SET "lock:octocat" 1 EX 10 NX

Relocking:

EXPIRE "lock:octocat" 10

Locks are not explicitly deleted and are left to expire.

Last Sync

Last sync data is stored in Redis to facilitate fast calculation of which user should be sync'd next. This data is rebuilt from the Storj bucket metadata on start up.

Initially each user is added to the tracked sorted set:

ZADD tracked 0 "github.com/octocat"

Where 0 is the last time the user fully synced or -1 if it has never been done.

Getting the next user to sync is accomplished by retrieving user's sorted by score (and then skipping any that are locked):

ZRANGEBYSCORE tracked "-inf" "+inf" LIMIT 0 1
SET "lock:octocat" 1 EX 10 NX