Replication Package for ICSE 2022 Paper "Recommending Good First Issues in GitHub OSS Projects"

This is the replication package for the ICSE 2022 paper Recommending Good First Issues in GitHub OSS Projects. It contains: 1) a dataset of 53,510 resolved issues (file issuedata.zip in Zenodo); and 2) scripts to train different models and reproduce evaluation results, as described in the paper.

The package is stored in the git repository https://github.com/mcxwx123/RecGFI and permanently archived at Zenodo. To reproduce results in the paper, it is necessary to properly configure an Anaconda environment to run the scripts or use the VirtualBox VM Image we provide at Zenodo.

Update 2022.07.15: We have released an up-to-date GFI recommendation dataset (with issues until 2022.07) from our GFI-Bot project! This dataset is better structured, has clearer fields, and is easier to reuse. Please consider also use this dataset in your research, and we will greatly appreciate it if you can cite our two papers:

@inproceedings{xiao2022recommending,
  title={Recommending good first issues in GitHub OSS projects},
  author={Xiao, Wenxin and He, Hao and Xu, Weiwei and Tan, Xin and Dong, Jinhao and Zhou, Migahui},
  booktitle={2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)},
  pages={1830--1842},
  year={2022},
  organization={IEEE}
}
@article{gfi-bot,
  title={GFI-Bot: Automated Good First Issue Recommendation on GitHub},
  author={He, Hao and Su, Haonan and Xiao, Wenxin and He, Runzhi and Zhou, Minghui}
}

Introduction

In the ICSE 2022 paper, we propose RecGFI, an effective practical approach for automated recommendation of Good First Issues to OSS newcomers, which can be used to relieve maintainers’ burden and help newcomers onboard.

For this purpose, we locate 100 newcomer-friendly GitHub projects, use GHTorrent and GitHub REST API to restore historical states of all their issues, and find issues resolved by newcomers. With the dataset, we check the performance of RecGFI under a variety of settings. Additionally, we collect latest open issues from the 100 projects and predict whether they are GFIs. We report potential GFIs to project maintainers and record the received responses and the state of these issues after several months.

All automated processing is implemented using Python in an Anaconda environment and the detailed results can be found in our paper. We hope the dataset and scripts in this replication package can be leveraged to facilitate further studies in recommending issues for newcomers and other related fields. We intend to claim the Artifacts Available badge and the Artifacts Evaluated - Reusable badge for our replication package. To meet the requirements of the badges, we provide persistent ways to download our artifact from Zenodo and GitHub repository. We also provide a VM image to ensure reprodicibility. We introduce the background, requirment environment and steps of the artifact to reproduce the results of our paper. Therefore, we believe our artifact is available and reusable.

File Structure

.
├── data                                         # Data files
│   ├── Statistics.png                           # Statistics of the dataset
│   ├── data_preprocess_1.py                     # Script to generate data for issues at their 1st time point
│   └── data_preprocess_2.py                     # Script to generate data for issues at their 2nd time point
├── intepretation                                # Files for RQ2
│   ├── data                                     # Data generated by running Run_lime.py
│       ├── allproft.mat                         # Data for Figure 6 in the paper
│       ├── ftvalue.mat                          # Data for Figure 4 in the paper
│       ├── oneproft.mat                         # Data for Figure 3 in the paper
│       ├── predata.mat                          # Data for Figure 2 in the paper
│       ├── rptpre.mat                           # Data for Figure 5 in the paper
│       └── test.csv                             # The processed/intepreted issue data for generating other *.mat data files
│   ├── draw_figs                                # Files to draw figures
│       ├── bars.m                               # Script to generate Figure 6 in the paper
│       ├── difrpt.m                             # Script to generate Figure 4 in the paper
│       ├── featurebox.eps                       # Figure 3
│       ├── featuresbox.m                        # Script to generate Figure 3 in the paper
│       ├── ftvalue.eps                          # Figure 4
│       ├── matprehist.m                         # Script to generate Figure 2 in the paper
│       ├── preres.eps                           # Figure 2
│       ├── probar.eps                           # Figure 6
│       ├── rptpre.eps                           # Figure 5
│       └── rptpre.m                             # Scripts to generate Figure 5 in the paper
│   ├── Run_lime.py                              # Script for running LIME, which saves results in data/test.csv
│   └── data_utils_lime.py                       # Utils for Run_lime.py
├── models                                       # Model files 
│   ├── utils                                    # Data processing utilities
│       ├── __init__.py
│       ├── data_utils.py                        # Utils to load full data set
│       ├── data_utils_ablate_comment.py         # Utils to load data without comment related features
│       ├── data_utils_ablate_event.py           # Utils to load data without issue-event related features
│       ├── data_utils_ablate_experience.py      # Utils to load data without developer experience related features
│       ├── data_utils_ablate_issue.py           # Utils to load data without issue title and description related features
│       ├── data_utils_ablate_label.py           # Utils to load data without issue lable related features
│       ├── data_utils_ablate_project.py         # Utils to load data without project background related features
│       ├── data_utils_ablate_rpt.py             # Utils to load data without reporter related features
│       ├── data_utils_baseline_comment.py       # Utils to load data with only comment related features
│       ├── data_utils_baseline_event.py         # Utils to load data with only issue-event related features
│       ├── data_utils_baseline_experience.py    # Utils to load data with only developer experience related features
│       ├── data_utils_baseline_issue.py         # Utils to load data with only issue title and description related features
│       ├── data_utils_baseline_label.py         # Utils to load data with only issue lable related features
│       ├── data_utils_baseline_project.py       # Utils to load data with only project background related features
│       ├── data_utils_baseline_rpt.py           # Utils to load data with only reporter related features
│       ├── metrics_util.py                      # Utils to caculate metrics for evaluating models
│       └── vectorize.py                         # Script to process text of issue title and comments
│   ├── RecGFI.py                                # Script to run baseline models
│   └── wordcloud.py                             # Script to draw a word cloud for issue descriptions
├── real_world_evaluation                        # Files for RQ3
│   └── prediction_real_world_issues.csv         # Issue responses for real world evaluation
├── LICENSE                                      # Licence for this replication package
├── Main.py                                      # Main entry script that preprocesses data, runs RecGFI and draws two word clouds
├── PAPER.pdf                                    # Our paper
├── README.md                                    # Instructions for using this replication package
├── requirements-lock.txt                        # Python dependency specifications
├── wordcloud0.png                               # Figure 7 (left) generated by models/wordcloud.py (color and layout of words may change each time)
└── wordcloud1.png                               # Figure 7 (right) generated by models/wordcloud.py

Update 2022.04.24

We add two additional files for locating the issues from data/issuedata.json on GitHub:

data/repo_id_info.csv records the owner name and repository name for repository with repo_id (ID in GHTorrent);
data/issue_id_info.csv records the owner name, repository name, issue number and the number of commit from the issue closer in the repository for issue with issue_id (ID in GHTorrent).

Thus, additional issue data can be retreived from GitHub for future research. For example, the data/issue_id_info.csv can serve as a ground truth dataset for evaluating new GFI recommendation approaches which collects data and builds features entirely from GitHub.

Required Skills and Environment

For unobstructed usage of this replication package, we expect the user to have a reasonable amount of knowledge on git, Linux, Python, Anaconda, and some experience with Python data science development.

We recommend to manually setup the required environment in a commodity Linux machine with at least 1 CPU Core, 8GB Memory and 100GB empty storage space. We conduct development and execute all our experiments on a Ubuntu 20.04 server with two Intel Xeon Gold CPUs, 320GB memory, and 36TB RAID 5 Storage. We have also vetted this replication package in a Ubuntu 20.04 VirtualBox VM with 1 CPU Core, 8GB Memory and 100GB storage, and a Windows 10 machine with 8 CPU Cores and 8GB Memory.

Replication Package Setup

In this section, we introduce how to set up the required environment for the reproducible results in the paper. First, clone this repository or download the repository archive from Zenodo.

Switch to the RecGFI folder. We use Anaconda for Python development. Configure a new Conda environment by executing the following commands:

conda create -n RecGFI python=3.8
conda activate RecGFI
python -m pip install -r requirements-lock.txt

If you download repository archive from Zenodo, you can already find the issue dataset at RecGFI/data/issuedata.json. However, it is too large (1.6GB) for git, so if you clone from GitHub, please download this file separately from Zenodo and put it there.

Using the VirtualBox VM Image

To ease the burden to build the required environment, we supply a VirtualBox VM Image to replicate experimental results quickly and easily. You can download the VM Image from Zenodo. Then register and open it with VirtualBox VM. The password is icse22ae. You can see a folder named RecGFI in the Desktop with everything already configured. In a terminal, remember to use conda activate RecGFI to activate the corresponding Conda environment before executing the scripts below.

Replicating Results

Switch your working directory to RecGFI. Run Main.py to get the performance results for RQ1 in our paper.

python Main.py

During this process, some data preprocessing will take place. We leave the preprocessed data in the RecGFI/data folder. You can comment out the function calls data_preprocess1() and data_preprocess2() in Main.py if the preprocessed file already exists. The whole script can consume up to 6GB memory and take about five to twelve hours to finish. It may generate some warning messages about logisitic regression but it is expected.

After running Main.py, you can get five CSV files in the folder RecGFI/models. They contain all tables for RQ1 in the paper.

As for RQ2, you can check the statistics of issue features in our dataset with RecGFI/data/Statistics.png. After running Main.py, the wordclouds of issues is shown in RecGFI/wordcloud0.png and RecGFI/wordcloud1.png. You can also run RecGFI/intepretation/Run_lime.py which will generate the data for drawing figures in RQ2.

cd intepretation
python Run_lime.py

The whole script can take up to one day to finish. It may generate some warning messages during the run. After running Run_lime.py, several *.mat files will be generated in RecGFI/intepretation/data. These files are already provided in our git repository. The *.m files in RecGFI/intepretation/draw_figs can be executed with Matlab 2020b or higher version to draw the figures. However, Matlab is proprietary software. According to the requirments for "Resuable" badge, "Proprietary artifacts need not be included". Therefore, we do not intend to claim badges for the Matlab part of our replication package.

As for RQ3, we save the status of involved issues in real_world_evaluation/prediction_real_world_issues.csv.

yizhihenpidehou/RecGFI