Research project of go-faster.
Content based on www.gharchive.org used under the CC-BY-4.0 license.
Utilities to work with GH Archive, project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.
[
{
"input": "1693 GB",
"content": "13 TB",
"output": "1191 GB"
}
]
[
{
"state": "NotFound",
"count": 319
},
{
"state": "Ready",
"count": 68952
}
]
319 of 68952 chunks are missing, not sure about restore, not critical.
Language data is not included in events.
There is incomplete (only 3 million repos) public dataset:
SELECT * FROM `bigquery-public-data.github_repos.languages`;
However, many popular repositories are missing and manual data retrieval is required.
Programming languages by repository as reported by GitHub's https://developer.github.com/v3/repos/#list-languages API
- No repo id, just name
- Probably no removed or renamed repos
- ~3 million entries
- Language data is in array (language name, bytes)