Introduction
The GHPR dataset identify defect information based on PRs in the Github workflow, Previous studies identified defective source code parts by git commits or issues on GitHub, but many descriptions of commits (or issues) are not well-formed, which adds noise to defect datasets. We believe that the defect information with team review and the GitHub workflow is more accurate.
data
Data Name | Type | Description |
Project Name | Project Information | The name of the project |
Project Owner | The owner of the project | |
Language | The programming of the project | |
Git_Address | The git adderss of the project | |
Project Description | The description of the project provided by the owner | |
Project Label | The label of the project provided by the owner | |
PR Title | PRs information from remote repository | The title of the defect related PR |
PR Description | The description of the defect related PR | |
SHA New | The ID of the version after the defect being fixed | |
SHA Old | The ID of the version before the defect being fixed | |
Path | The path of the changed files between the version before/after fixed | |
Content_New | Defect related Code | The content of the related files after the defecr being fixed |
Content_Old | The content of the related files before the defecr being fixed | |
Defect Code | The defect code we recognize from the content of the related files content | |
Requirements
Quick Start
.csv and .sql
References
Please cite our paper if you use this dataset in your own work: