Welcome to the BigIssue dataset GitHub repository! This is the official code repository for the BigIssues dataset and paper (paper link available soon).
The relevant data can be browsed here once authenticated with Google Cloud, and downloaded from https://storage.googleapis.com/bigissue-research. One is required to have a valid Google Cloud account in order to browse data.
In order to download the full dataset, one must have the gsutil
command installed and active. Instructions on installation can be found here. Then one can retrieve datasets like this:
gsutil -m cp -r gs://bigissue-research/<name of dataset>/ .
There are three versions of the dataset:
- synthetic_small (2048 tokens, 128 lines)
- synthetic_large (8192 tokens, 512 lines)
- Realistic dataset (Non-tokenized)
- realistic/single_file (Issues with changes to one Java file)
- realistic/multi_file (Issues with changes to multiple Java files)
Synthetic data are pieces of code with infilled samples as described in the paper. Each sample is a TFRecord
that is a concatenation of the tokenized code snippet and the label vector. We use RobertaTokenizer for tokenization. Labels consist of a vector of length 128 (512), where each line is marked as either 1 (buggy), 0 (not buggy), -1 (padded).
The synthetic data is already split into train, validation, and test splits.
Realistic data consists of issue information for all of the issues we've collected as described in the paper. The directory structure is as follows:
realistic
single_file
<repository_user>-<repository_name>
<issue number>
fixed.tar.gz
- fixed version of the repositoryunfixed.tar.gz
- unfixed version of the repositorydiff
- diff between unfixed and fixed versionissue.jsonl
- issue metadata
multi_file
(similar directory structure)
We provide the fixed.tar.gz
and unfixed.tar.gz
states of the repository, as well as the diff
containing the changed diff and issue.jsonl
with information about the issue.
Training code is provided in the /training
directory. Code is written for Python 3.8 and higher. All pip requirements are provided in requirements.txt
.
Checkpoints are located on the Google Cloud Storage bucket in https://storage.googleapis.com/bigissue-research/checkpoints. One can download them with the command
wget https://storage.googleapis.com/bigissue-research/checkpoints/pooling_real_data/model
An example of loading a model from the checkpoint is provided in examples/example_model_loading.py
.
We will put a citation link once the paper is published on Arxiv.
If you have any feedback, please either create an Issue here on GitHub or send an email to pkassianik@salesforce.com
.