/Mixup4Code

[SANER 2023] MixCode: Enhancing Code Classification by Mixup-Based Data Augmentation

Primary LanguagePython

MixCode: Enhancing Code Classification by Mixup-Based Data Augmentation

Implementation of SANER2023 paper MixCode: Enhancing Code Classification by Mixup-Based Data Augmentation [arxiv].

We build this project on the top of Project_CodeNet. Please refer to this project for more details.

Introduction

MIXCODE aims to effectively supplement valid training data without manually collecting or labeling new code, inspired by the recent advance named Mixup in computer vision. Specifically, 1) first utilize multiple code refactoring methods to generate transformed code that holds consistent labels with the original data; 2) adapt the Mixup technique to linearly mix the original code with the transformed code to augment the training data.

Requirements

On Ubuntu:

  • Task: Classification
Python (>=3.6)
TensorFlow (version 2.3.0) 
Keras (version 2.4.3)
CUDA 10.1
cuDNN (>=7.6)
  • Task: Bug Detection
Python (>=3.6)
Pytorch (version 1.6.0) 
CUDA 10.1
cuDNN (>=7.6)

CodeBERT/GraphCodeBERT for Classification Tasks

  • pip install torch==1.4.0
  • pip install transformers==2.5.0
  • pip install filelock

Fine-Tune

cd CodeBERT

python run.py \
    --output_dir=./saved_models \
    --tokenizer_name=microsoft/codebert-base \
    --model_name_or_path=microsoft/codebert-base \
    --do_train \
    --num_train_epochs 50 \
    --block_size 256 \
    --train_batch_size 8 \
    --eval_batch_size 16 \
    --learning_rate 2e-5 \
    --max_grad_norm 1.0 \
    --num_labels 250 \  # Number Classifications
    --seed 123456  2>&1 | tee train.log
cd GraphCodeBERT

python run.py \
    --tokenizer_name=microsoft/graphcodebert-base \
    --model_name_or_path=microsoft/graphcodebert-base \
    --config_name microsoft/graphcodebert-base \
    --do_train \
    --num_train_epochs 50 \
    --code_length 384 \
    --data_flow_length 384 \
    --train_batch_size 8 \
    --eval_batch_size 16 \
    --learning_rate 2e-5 \
    --max_grad_norm 1.0 \
    --evaluate_during_training \
    --num_labels 250 \  # Number Classifications
    --seed 123456  2>&1 | tee train.log

Dataset

Citation

If you use the code in your research, please cite:

    @inproceedings{dong2023mixcode,
      title={MixCode: Enhancing Code Classification by Mixup-Based Data Augmentation},
      author={Dong, Zeming and Hu, Qiang and Guo, Yuejun and Cordy, Maxime and Papadakis, Mike and Zhang, Zhenya and Le Traon, Yves and Zhao, Jianjun},
      booktitle={2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)},
      pages={379--390},
      year={2023},
      organization={IEEE}
    }