python/core-workflow

Backup GitHub information

brettcannon opened this issue Β· 24 comments

Better safe than sorry.

Versioned content history

Better safe than sorry.

Certainly!

Additionally, GitHub issues are editable and deletable. Accordingly, there is potential for abuse (e.g. deceptive revisions) that a diff-tracking backup system would help prevent. On the other hand, in the case of content that needs to be purged (e.g. copyrighted, malicious, or inappropriate material), persistence of old versions could be problematic.

I recently created a backup solution for a text corpus (dhimmel/thinklytics) that errs on the side of versioning. To backup the content and track history, this repo uses scheduled Travis CI builds to download and process the content. If successful, the CI job commits the changes back to GitHub. I'm not sure if this method would scale to the activity level of python repositories. Especially, if you'd like to back up all content including uploads / images attached to comments. So just a thought.

There are several options according to: https://help.github.com/articles/backing-up-a-repository/

I haven't tried the new migrations API, but I've tried one of the backup mechanism mentioned in https://help.github.com/articles/backing-up-a-repository/, by using GitHub Records Archiver.

I used my personal access token to run the script. It was able to backup these repos for me within python organization, before it came across API rate-limit issue πŸ˜›

But for each of the projects that it did back up:

  • it downloads the issues and PRs as both .md and .json formats
  • it has git clone of the repo, retaining git history

It was able to back up these projects before I used up all my available API calls..

community-starter-kit		psf-ca				pypi-cdn-log-archiver
docsbuild-scripts		psf-chef			pypi-salt
getpython3.com			psf-docs			pythondotorg
historic-python-materials	psf-fastly			raspberryio
mypy				psf-salt			teams
peps				psfoutreach
planet				pycon-code-of-conduct

Ok just read this about the Migrations AP:

The Migrations API is only available to authenticated organization owners. The organization migrations API lets you move a repository from GitHub to GitHub Enterprise.

This is as far as I can go since I'm not Python organization owner :)

@Mariatta I would ask @ewdurbin if you can maybe become an org owner to continue to look into this (or bug him to πŸ˜‰ ).

I kicked off an archive for python/cpython just to see what it produces. Once it finishes I'll summarize the contents here and we can discuss if it fits our needs.

Thanks @brettcannon and @ewdurbin :)

I archived my own project (black_out), and the output is [link expired]

It's not as huge as CPython, figured it might be easier to analyze.

Nevermind that link above, it timed out πŸ˜… Here is the downloaded content:
f8244650-6b4c-11e8-8b72-4d7fe688e0a1.tar.gz

The result of the Migrations API dump appears to have everything and is well organized.

Since the intention of the dump is for migrating from GitHub to GitHub Enterprise and the dump is an official GitHub offering (although currently in preview), it seems to be the solution that is least likely to require any regular maintenance beyond ensuring it's run and that we have collected and stored the tarball safely.

Summary of what's there, on a cursory glance these generally line up with some GitHub API object in JSON format from their API:

schema.json: contains a version specifier for what I assume is the dump version, and a github_sha for what I assume is the version of the GitHub codebase that ran the dump.

repositories_NNNNNN.json: metadata about the the repository's GitHub configuration including the enabled settings (has_issues, has_wiki, has_downloads) as well as the labels, collaborators, and webhooks configured.

repositories: directory containing the actual git repos!

protected_branches.json: the configuration for branches that have specific requirements for merging. this includes review requirements, status checks, and enforcement settings.

users_NNNNNN.json and organizations_NNNNNN.json metadata round all GitHub Users and associated Organizations that have interacted with the repository via commit, PR, PR review, or comment.

teams_NNNNNN.json: contains the various teams defined in the organization and their permissions on various repositories.

beyond that we get into the primitives that comprise what we see as a "Pull Request" or "Issue", again these appear to line up 1:1 with JSON objects from the GitHub API.

attachments and attachments_NNNNNN.json

pull_requests_NNNNNN.json and issues_NNNNNN.json

pull_request_reviews_NNNNNN.json:

commit_comments_NNNNNN.json, issue_comments_NNNNNN.json, pull_request_review_comments_NNNNNN.json

Thanks for the update, @ewdurbin!
I'm a little curious how long does the backup take, but it doesn't matter much. To me the backup data is great!! πŸ˜„

Will you be able to set up daily backups for the python GitHub org? (cpython is higher priority I would think πŸ˜‡)
I assume this is something that can be stored within PSF's infrastructure.

Thanks!

@Mariatta the backup took about 15 minutes to run, but it's asynchronous so we can just kick them off and then poll for completion before pulling the archive.

The result was 320 MB so I'm curious if weekly might suffice for now?

If we stick with daily, what kind of retention would we want?

Daily backups for the past week, weekly backups for the past month, monthly backups forever?

Hmm I don't know what the usual good backup practise is.. Open to suggestions.
I'm thinking at the very least we really should do daily backups. How long to keep it, I don't know :)

How crazy would it be to stick everything to a git repo that would be hosted on github but also mirrored somewhere else?

that's probably not completely out of the realm of reasonability. the biggest concern there would be attachments and the notorious "git + big/binary files" limitations.

the biggest concern there would be attachments and the notorious "git + big/binary files" limitations.

For large binary files, I would suggest using Git LFS. GitHub supports LFS files up to two gigabytes in size. If your organization qualifies for GitHub education, you can request a free LFS quota. It is also possible to use a GitHub repository with LFS assets stored with GitLab, however the interface is less user friendly this way.

Well I assume there are limits for the attachments anyway. @ewdurbin could you check what's the biggest file there?

Limitations on attachments are documented here

That's not that big. I mean of course versioning a 25 MB binary blob will eventually be crazy, but those attachments don't change over time IMHO.

I think Ernest's retention policy suggestion works.

Okay, for the initial pass I'll setup a task to kick off the "migration" and fetch it once complete each day.

I think the archives can just be dropped in an S3 bucket with a little bit of structure and some retention policies to automatically clear out unnecessary archives.

Will post back here with more information.

Never came back and updated this. Ended up using Backhub for this. It is keeping daily snapshots for the past month and pushing archives to S3 as well.

In that case, this issue can probably be closed! :)

Backup is set up, and no comments in the past ~2 years, closing! πŸ’ΎπŸ’Ύ