Tools for the GHPR dataset.
GHPR contains data about pull requests that have fixed one or more issues on GitHub. Each instance of GHPR contains data about an issue and a pull request, where the pull request has fixed the issue.
Install the requirements from requirements.txt
:
pip3 install -r requirements.txt
GHPR Crawler uses the GitHub REST API to find pull requests that have fixed one or more issues on GitHub. It saves such issues and pull requests as JSON files.
The raw data for GHPR is an example of data generated by the GHPR Crawler.
The flow is as follows:
- For repository R:
- For each page G in the list of closed pull requests in R, from oldest to newest:
- Optionally save G.
- For each simple pull request p in G:
- If p is merged:
- Let L be the list of issue numbers that are linked by p using a GitHub keyword and are in R.
- If L is not empty:
- Fetch pull request P with the pull request number of p.
- Set the
linked_issue_numbers
property of P to L. - Save P.
- For each issue number i in L:
- Fetch issue I with the issue number i.
- Save I.
- If p is merged:
- For each page G in the list of closed pull requests in R, from oldest to newest:
Run python3 crawler.py --help
for usage.
$ python3 crawler.py --help
usage: crawler.py [-h] [-t TOKEN] [-d DST_DIR] [-s START_PAGE] [-p PER_PAGE]
[-a] [-m MAX_REQUEST_TRIES] [-r REQUEST_RETRY_WAIT_SECS]
[-l LOG_FILE]
repo [repo ...]
Crawl GitHub repositories to find and save issues and pull requests that have
fixed them. The crawler goes through the pages of closed pull requests, from
oldest to newest. If a pull request is merged and links one or more issues in
its description, the pull request and its linked issue(s) will be fetched and
saved as JSON files. The list of linked issue numbers is added to the fetched
pull request JSON object with the key "linked_issue_numbers". The JSON files
will be saved in DEST_DIR/owner/repo. The directories will be created if they
do not already exist. The naming pattern for files is issue-N.json for issues,
pull-N.json for pull requests, and pulls-page-N.json for pages of pull
requests. Any existing file will be overwritten. The GitHub API limits
unauthenticated clients to 60 requests per hour. The rate limit is 5,000
requests per hour for authenticated clients. For this reason, you should
provide a GitHub OAuth token if you want to crawl a large repository. You can
create a personal access token at https://github.com/settings/tokens.
positional arguments:
repo full repository name, e.g., "octocat/Hello-World" for
the https://github.com/octocat/Hello-World repository
optional arguments:
-h, --help show this help message and exit
-t TOKEN, --token TOKEN
your GitHub OAuth token, can also be provided via a
GITHUB_OAUTH_TOKEN environment variable (default:
None)
-d DST_DIR, --dst-dir DST_DIR
directory for saving JSON files (default: repos)
-s START_PAGE, --start-page START_PAGE
page to start crawling from (default: 1)
-p PER_PAGE, --per-page PER_PAGE
pull requests per page, between 1 and 100 (default:
100)
-a, --save-pull-pages
save the pages of pull requests (default: False)
-m MAX_REQUEST_TRIES, --max-request-tries MAX_REQUEST_TRIES
number of times to try a request before terminating
(default: 100)
-r REQUEST_RETRY_WAIT_SECS, --request-retry-wait-secs REQUEST_RETRY_WAIT_SECS
seconds to wait before retrying a failed request
(default: 10)
-l LOG_FILE, --log-file LOG_FILE
file to write logs to (default: None)
See crawler.py
.
class Crawler(object):
"""Crawl GitHub repositories to find and save merged pull requests and the issues
they have fixed.
The crawler goes through the pages of closed pull requests, from oldest to
newest. If a pull request is merged and links one or more issues in its
description, the pull request and its linked issue(s) will be fetched and
saved as JSON files. The list of linked issue numbers is added to the fetched
pull request JSON object with the key "linked_issue_numbers". The JSON files
will be saved in DEST_DIR/owner/repo. The directories will be created if they
do not already exist. The naming pattern for files is issue-N.json for issues,
pull-N.json for pull requests, and pulls-page-N.json for pages of pull
requests. Any existing file will be overwritten. The GitHub API limits
unauthenticated clients to 60 requests per hour. The rate limit is 5,000
requests per hour for authenticated clients. For this reason, you should
provide a GitHub OAuth token if you want to crawl a large repository. You can
create a personal access token at https://github.com/settings/tokens.
Attributes:
dst_dir (str): Directory for saving JSON files.
per_page (int): Pull requests per page, between 1 and 100.
save_pull_pages (bool): Save the pages of pull requests.
max_request_tries (int): Number of times to try a request before
terminating.
request_retry_wait_secs (int): Seconds to wait before retrying a failed request.
"""
def __init__(self,
token=None,
dst_dir='repos',
per_page=100,
save_pull_pages=False,
max_request_tries=100,
request_retry_wait_secs=10):
"""Initializes Crawler.
The GitHub API limits unauthenticated clients to 60 requests per hour. The
rate limit is 5,000 requests per hour for authenticated clients. For this
reason, you should provide a GitHub OAuth token if you want to crawl a large
repository. You can create a personal access token at
https://github.com/settings/tokens.
Args:
token (str): Your GitHub OAuth token. If None, the crawler will be
unauthenticated.
dst_dir (str): Directory for saving JSON files.
per_page (int): Pull requests per page, between 1 and 100.
save_pull_pages (bool): Save the pages of pull requests.
max_request_tries (int): Number of times to try a request before
terminating.
request_retry_wait_secs (int): Seconds to wait before retrying a failed request.
"""
def crawl(self, owner, repo, start_page=1):
"""Crawls a GitHub repository, finds and saves merged pull requests and the issues
they have fixed.
The crawler goes through the pages of closed pull requests, from oldest to
newest. If a pull request is merged and links one or more issues in its
description, the pull request and its linked issue(s) will be fetched and
saved as JSON files. The list of linked issue numbers is added to the fetched
pull request JSON object with the key "linked_issue_numbers". The JSON files
will be saved in DEST_DIR/owner/repo. The directories will be created if they
do not already exist. The naming pattern for files is issue-N.json for issues,
pull-N.json for pull requests, and pulls-page-N.json for pages of pull
requests. Any existing file will be overwritten.
Args:
owner (str): The username of the repository owner, e.g., "octocat" for the
https://github.com/octocat/Hello-World repository.
repo (str): The name of the repository, e.g., "Hello-World" for the
https://github.com/octocat/Hello-World repository.
start_page (int): Page to start crawling from.
Raises:
TooManyRequestFailures: A request failed max_request_tries times.
"""
GHPR Writer reads JSON files downloaded by the GHPR Crawler and writes a CSV file from their data.
The GHPR dataset is an example of data generated by the GHPR Writer.
Run python3 writer.py --help
for usage.
$ python3 writer.py --help
usage: writer.py [-h] [-l LIMIT_ROWS] src_dir dst_file
Read JSON files downloaded by the Crawler and write a CSV file from their
data. The source directory must contain owner/repo/issue-N.json and
owner/repo/pull-N.json files. The destination directory of Crawler should
normally be used as the source directory of Writer. The destination file will
be overwritten if it already exists.
positional arguments:
src_dir source directory
dst_file destination CSV file
optional arguments:
-h, --help show this help message and exit
-l LIMIT_ROWS, --limit-rows LIMIT_ROWS
limit number of rows to write, ignored if non-positive
(default: 0)
See writer.py
.
def write_dataset(src_dir, dst_file, limit_rows=0):
"""Reads JSON files downloaded by the Crawler and writes a CSV file from their
data.
The CSV file will have the following columns:
- repo_id: Integer
- issue_number: Integer
- issue_title: Text
- issue_body_md: Text, in Markdown format, can be empty
- issue_body_plain: Text, in plain text, can be empty
- issue_created_at: Integer, in Unix time
- issue_author_id: Integer
- issue_author_association: Integer enum (see values below)
- issue_label_ids: Comma-separated integers, can be empty
- pull_number: Integer
- pull_created_at: Integer, in Unix time
- pull_merged_at: Integer, in Unix time
- pull_comments: Integer
- pull_review_comments: Integer
- pull_commits: Integer
- pull_additions: Integer
- pull_deletions: Integer
- pull_changed_files: Integer
The value of issue_body_plain is converted from issue_body_md. The conversion is
not always perfect. In some cases, issue_body_plain still contains some Markdown
tags.
The value of issue_author_association can be one of the following:
- 0: Collaborator
- 1: Contributor
- 2: First-timer
- 3: First-time contributor
- 4: Mannequin
- 5: Member
- 6: None
- 7: Owner
Rows are sorted by repository owner username, repository name, pull request
number, and then issue number.
The source directory must contain owner/repo/issue-N.json and
owner/repo/pull-N.json files. The destination directory of Crawler should
normally be used as the source directory of Writer. The destination file will be
overwritten if it already exists.
Args:
src_dir (str): Source directory.
dst_file (str): Destination CSV file.
limit_rows (int): Maximum number of rows to write.
"""