A simple backup script to archive your Github repositories to AWS S3. I'm a proponent of personal data ownership but honestly feel like a large number of online services are more hassle than they're worth to self-host. Most of my personal git repositories I have either made public or plan to do so, so there aren't many privacy concerns with keeping them in "the cloud". That being said, I don't want to rely solely on a 3rd party to safeguard my data, and I'd like to minimize vendor lock-in as much as possible.
This project is meant to automate the process of generating backups of all of a user's github repositories in a vendor-neutral (git bundle
) format.
Those files are then uploaded to to S3 (yes yes, I know... avoiding cloud lock-in by using another cloud service).
The basic process goes something like:
- Scan all repositories on Github
- Pull all branches and their current sha hashes, combine them to create a repository version fingerprint
- Scan all bundles currently in the bucket and pull the last recorded version fingerprint from the metadata
- Calculate which repositories need to be backed up and then:
- Clone it into a temp directory
- Create a git bundle
- Upload the bundle
This tool is stateless except for the configuration and the backups themselves.
GITHUB_USERNAME=
GITHUB_ACCESS_TOKEN=
AWS_REGION=
AWS_BUCKET=
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
HEALTHCHECK_PING_URL=
Enter your github username. Generate a personal access token with the "repo" scope at https://github.com/settings/tokens/new?scopes=repo and copy that value into the environment file as well.
Create an access key for an IAM user with sufficient S3 permissions.
You can probably narrow it down more, but I just use the AmazonS3FullAccess
policy.
Create a bucket to store the repositories in. It should probably be private. If object versioning is enabled, you'll probably also want some process to clear out old versions.
The access key id, access key, and bucket name all need to be placed in the environment file.
There is also an (optional) integration with Healthchecks.io that might work with other monitoring systems as well.
If you add an environment value for HEALTHCHECK_PING_URL
, it will perform a simple GET request to that URL at the end of every successful run.
- Clone this project locally
- Copy
.env.template
to.env
and fill in the configuration options - Ensure that you have the following dependencies installed: node 15, git
- Complete the configuration steps above
- Run
yarn install
to get dependencies from npm - Run
yarn backup-local
to trigger the script
Maybe think about putting the last step in a cron if you want it to happen automatically.
- Ensure that you have docker installed
- Create a local
.env
file based on the template in this repository - Run
docker run --env-file .env --rm isaacsimmons/github-s3-backup
Maybe think about putting the last step in a cron if you want it to happen automatically. Depending on your permissions, you may need to launch the docker command with sudo or as root.
0 2 * * * docker run --env-file /home/pi/github-s3-backup/.env --rm isaacsimmons/github-s3-backup:arm
(Really just a note for myself about publishing)
TODO: README on DockerHub
docker login
docker build . -t isaacsimmons/github-s3-backup
docker build . -f Dockerfile.arm -t isaacsimmons/github-s3-backup:arm
docker push isaacsimmons/github-s3-backup
docker push isaacsimmons/github-s3-backup:arm
But also, just setup automated builds on docker hub.
This only works with Github currently, but could probably be extended to scrape from other git hosting providers without too much effort. Likewise, it has only been configured to backup the files to Amazon S3, but again wouldn't be difficult to extend with additional storage backends.
Only the commits themselves are backed up. If you care about Pull Requests, Issues, Wikis, etc then support for those would need to be added.
The fetching of object metadata from S3 and branch information from github are both un-batched and issue one request per repository. In the case that you have a very large number of repositories, this may incur a signifcant network or time cost.
Repositories are only backed up if they have changed since the last time this tool was run. However, if there are changes, the backup process isn't incremental. For particularly large git repositories, this may introduce undeseriable performance considerations.