Defect Detection

Task Definition

Using CodeBERT to embed the codes and to identify whether they contain vulnerabilities.

Preprocess

Local: Download dataset from website to "dataset" folder or run the following command.

cd dataset
pip install gdown
gdown https://drive.google.com/uc?id=1x6hoF7G-tSYxg8AFybggypLZgMGDNHfF
python preprocess.py
cd ..

CoLab: Preprocess dataset.

git clone https://github.com/LovelyBuggies/Defect-Detection.git
pip3 install transformers
cd Defect-Detection/

Fine-tune

cd code
python run.py \
    --output_dir=./saved_models \
    --model_type=roberta \
    --tokenizer_name=microsoft/codebert-base \
    --model_name_or_path=microsoft/codebert-base \
    --do_train \
    --do_eval \
    --do_test \
    --train_data_file=../dataset/train.jsonl \
    --eval_data_file=../dataset/valid.jsonl \
    --test_data_file=../dataset/test.jsonl \
    --epoch 5 \
    --block_size 400 \
    --train_batch_size 32 \
    --eval_batch_size 64 \
    --learning_rate 2e-5 \
    --max_grad_norm 1.0 \
    --evaluate_during_training \
    --seed 123456  2>&1 | tee train.log

Evaluation

python ../evaluator/evaluator.py -a ../dataset/test.jsonl -p saved_models/predictions.txt

{'Acc': 0.6325106295754027}

Result

The results on the test set are shown as below:

Methods	ACC
BiLSTM	59.37
TextCNN	60.69
RoBERTa	61.05
CodeBERT	63.25