Gary Crye - Interest: 8/10
GitHub promotes several community health files that supposedly help grow communities. GitHub has built tight integrations around certain files so that repository owners can customize and enhance the experience of users and potential contributors. An easy example of this is how the README.md
serves as the repository's landing page. A more complicated example is how a GitHub Actions workflow (.github/workflows/*.yml
) can run automated testing, increasing confidence in new code contributions. We'd like to measure what (if any) impact these and other features have on unique contributor counts.
This information might help us know what features to prioritize in future projects if we were looking for outside contributions from the open source community.
- Basic repository details
- README length
- Number of topics (like hashtags or filters)
- Seconds since first push to GitHub
- Seconds since most recent update
- Number of languages used
- Most common language used, which might help us control for relative language popularity
- Community Health files
LICENSE
type, probably using the SPDX identifier- In theory, no LICENSE makes contributing nearly impossible. Can we somehow bin licenses from more-permissive to less-permissive to see what impact that makes?
CODE_OF_CONDUCT
CONTRIBUTING
FUNDING
- Number of
ISSUE_TEMPLATE
s - Number of
PULL_REQUEST_TEMPLATE
s SECURITY
SUPPORT
- Other GitHub features
CODEOWNERS
- Existence of a
CHANGELOG
or usage of Releases - Usage of Milestones
- CodeSpaces enabled
- Wiki enabled
- Discussions enabled (
GET https://github.com/{ORG_NAME}/{REPO_NAME}/discussions
) - Number of project boards
- GitHub Actions usage
- Uses GitHub Actions (
GET https://github.com/{ORG_NAME}/{REPO_NAME}/blob/{DEFAULT_BRANCH_NAME}/.github/workflows
) - Number of GitHub Actions workflows
- Number of GitHub Actions workflow runs
- Uses GitHub Actions (
- Dependabot usage (
GET https://github.com/{ORG_NAME}/{REPO_NAME}/blob/{DEFAULT_BRANCH_NAME}/.github/dependabot.yml
) - Uses other GitHub checks integrations, possibly from 3rd party CI tools (Probably measured by looking for the existence of a check suite on the most recent commit)
- Some stuff that would be harder to figure out
- Average seconds until first Issue response
- Average seconds until first Pull Request response
- Ratio of pull requests merged to pull requests closed without merging
- Number of code contributors, possibly normalized on a log scale ← Main goal
- Number of forks
- Number of watchers
- Number of stargazers
- Open issues count
- Pull request count
README length | Number of topics | Seconds since first push to GitHub | Seconds since last update | Number of languages used | Most common language used | LICENSE type | CODE_OF_CONDUCT | CONTRIBUTING | FUNDING | Number of issue templates | Number of pull request templates | SECURITY | SUPPORT | CODEOWNERS | CHANGELOG or Releases | Milestones | CodeSpaces | Wiki | Discussions | Number of GitHub Actions workflows | Dependabot | GitHub Checks | Average seconds until first issue response | Average seconds until first pull request response | Ratio of pull requests merged to pull requests closed without merging | Number of unique contributors |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
30139 | 12 | 36010 | 3600 | 5 | JavaScript | Apache-2.0 | True | True | False | 2 | 1 | True | True | True | True | False | False | False | True | 21 | False | True | 3600 | 3600 | 2.5 | 312 |
As linked above, data about these features are available by calling public URLs or by calling corresponding publicly-available endpoints on GitHub's API, but the data will likely require some preprocessing. For example, there's an endpoint to get a repository's README, but not one for its length: we'd need to calculate and possibly normalize that.
There are some caveats. We might be able to use GitHub's newer GraphQL API to simplify some of this data collection, but many of the endpoints above aren't available through the GraphQL interface yet. Also, because we're likely making several API calls for each repository, we're likely to run into rate limits.
To get a list of repositories for our data set, we'd likely call GitHub's repository API in a way similar to this. I like the approach from the example of only grabbing repositories with at least 10 stargazers, since that excludes a lot of worthless data.