/cs472-proposal

Primary LanguageJupyter NotebookMIT LicenseMIT

GitHub Community Health

Gary Crye - Interest: 8/10

Context

GitHub promotes several community health files that supposedly help grow communities. GitHub has built tight integrations around certain files so that repository owners can customize and enhance the experience of users and potential contributors. An easy example of this is how the README.md serves as the repository's landing page. A more complicated example is how a GitHub Actions workflow (.github/workflows/*.yml) can run automated testing, increasing confidence in new code contributions. We'd like to measure what (if any) impact these and other features have on unique contributor counts.

This information might help us know what features to prioritize in future projects if we were looking for outside contributions from the open source community.

Data Description

Some possible input

To try to predict

Example

README length Number of topics Seconds since first push to GitHub Seconds since last update Number of languages used Most common language used LICENSE type CODE_OF_CONDUCT CONTRIBUTING FUNDING Number of issue templates Number of pull request templates SECURITY SUPPORT CODEOWNERS CHANGELOG or Releases Milestones CodeSpaces Wiki Discussions Number of GitHub Actions workflows Dependabot GitHub Checks Average seconds until first issue response Average seconds until first pull request response Ratio of pull requests merged to pull requests closed without merging Number of unique contributors
30139 12 36010 3600 5 JavaScript Apache-2.0 True True False 2 1 True True True True False False False True 21 False True 3600 3600 2.5 312

Gathering the Data

As linked above, data about these features are available by calling public URLs or by calling corresponding publicly-available endpoints on GitHub's API, but the data will likely require some preprocessing. For example, there's an endpoint to get a repository's README, but not one for its length: we'd need to calculate and possibly normalize that.

There are some caveats. We might be able to use GitHub's newer GraphQL API to simplify some of this data collection, but many of the endpoints above aren't available through the GraphQL interface yet. Also, because we're likely making several API calls for each repository, we're likely to run into rate limits.

To get a list of repositories for our data set, we'd likely call GitHub's repository API in a way similar to this. I like the approach from the example of only grabbing repositories with at least 10 stargazers, since that excludes a lot of worthless data.