/GHPR_dataset

Github defect dataset for software defect prediction

Introduction

The GHPR dataset identify defect information based on PRs in the Github workflow, Previous studies identified defective source code parts by git commits or issues on GitHub, but many descriptions of commits (or issues) are not well-formed, which adds noise to defect datasets. We believe that the defect information with team review and the GitHub workflow is more accurate.

data

Data Name Type Description
Project Name Project Information The name of the project
Project Owner The owner of the project
Language The programming of the project
Git_Address The git adderss of the project
Project Description The description of the project provided by the owner
Project Label The label of the project provided by the owner
PR Title PRs information from remote repository The title of the defect related PR
PR Description The description of the defect related PR
SHA New The ID of the version after the defect being fixed
SHA Old The ID of the version before the defect being fixed
Path The path of the changed files between the version before/after fixed
Content_New Defect related Code The content of the related files after the defecr being fixed
Content_Old The content of the related files before the defecr being fixed
Defect Code The defect code we recognize from the content of the related files content

Requirements

Quick Start

.csv and .sql

References

Please cite our paper if you use this dataset in your own work: