Learning Real Bug Detectors
This is the official repository for the paper: On Distribution Shift in Learning-based Bug Detectors.
Setup
The code requires python3
(we use python3.9
) and some Python packages that can be installed via pip install -r requirements.txt
. Make sure to add this repository to PYTHONPATH
.
Downloading Datasets and Models
We provide the following resources for download:
- Our datasets: link.
- Our fine-tuned models: link.
- Pretrained models (converted from CuBERT, including the tokenizer vocabulary): link.
After downloading and decompressing the above files, the directory structure should be organized as follows:
└──learning-real-bug-detector
│
└──dataset
│
└──fine-tuned
│
└──pretrained
Running the Code
You can run the code via the scripts under the scripts/
directory.
Evaluation and Fine-tuning
Evaluation can be done with the command below, where TASK_NAME
is the bug type (var-misuse
, wrong-binary-operator
, or argument-swap
). MODEL_NAME is the name of the model (e.g., model
if you use our fine-tuned models). Optionally, you can use the --probs_file
to store the prediction results and use calculate_ap.py
to compute average precision.
(scripts/) $ python eval.py --task TASK_NAME --model MODEL_NAME
Fine-tuning can be done with the command below, where DATASET_NAME
can be real
, synthetic
, or contrastive
. The paper describes a two-phase training scheme, first with --dataset contrastive
and then with --dataset real
(use --pretrained
to continue from the previous checkpoint). Other fine-tuning parameters are defaulted to be the best parameters in the paper evaluation.
(scripts/) $ python fine-tune.py --task TASK_NAME --model MODEL_NAME --dataset DATASET_NAME
Constructing Datasets from Scratch
If you are interested in constructing the datasets from scratch, you need to clone eth_py150_open, download py150_files, and install near-duplicate-code-detector. For var-misuse
and wrong-binary-operator
, the datasets constructed from eth_py150
repositories have a sufficient amount of real bugs. For argument-swap
, more repositories are needed to produce enough real bugs. The directory structure should be organized as follows:
└──learning-real-bug-detector
│
└──data
│
└──near-duplicate-code-detector
│
└──eth_py150_open
│
└──py150_files
│
└──data
Then run the following commands:
(scripts/data_gen_real/) $ python clone_repos.py --in_file all_py150_repos.txt
(scripts/data_gen_real/) $ ./run_real_bugs_from_repo.sh TASK_NAME
(scripts/data_gen_synthetic/) $ python gen_jsontxt.py --task TASK_NAME
(scripts/data_gen_synthetic/) $ python clean_jsontxt.py --task TASK_NAME
(scripts/data_gen_real/) $ python split_real.py --task TASK_NAME
(scripts/data_gen_synthetic/) $ python filter_train_data.py --task TASK_NAME
(scripts/data_gen_synthetic/) $ python gen_synthetic_train_data.py --task TASK_NAME
(scripts/data_gen_synthetic/) $ python gen_synthetic_train_data.py --task TASK_NAME --contrastive