Implementaion of SANER2023 paper MixCode: Enhancing Code Classification by Mixup-Based Data Augmentation [arxiv].
MIXCODE aims to effectively supplement valid training data without manually collecting or labeling new code, inspired by the recent advance named Mixup in computer vision. Specifically, 1) first utilize multiple code refactoring methods to generate transformed code that holds consistent labels with the original data; 2) adapt the Mixup technique to linearly mix the original code with the transformed code to augment the training data.
On Ubuntu:
- Task: Classification
Python (>=3.6)
TensorFlow (version 2.3.0)
Keras (version 2.4.3)
CUDA 10.1
cuDNN (>=7.6)
- Task: Bug Detection
Python (>=3.6)
Pytorch (version 1.6.0)
CUDA 10.1
cuDNN (>=7.6)
- pip install torch==1.4.0
- pip install transformers==2.5.0
- pip install filelock
cd CodeBERT
python run.py \
--output_dir=./saved_models \
--tokenizer_name=microsoft/codebert-base \
--model_name_or_path=microsoft/codebert-base \
--do_train \
--num_train_epochs 50 \
--block_size 256 \
--train_batch_size 8 \
--eval_batch_size 16 \
--learning_rate 2e-5 \
--max_grad_norm 1.0 \
--num_labels 250 \ # Number Classifications
--seed 123456 2>&1 | tee train.log
cd GraphCodeBERT
python run.py \
--tokenizer_name=microsoft/graphcodebert-base \
--model_name_or_path=microsoft/graphcodebert-base \
--config_name microsoft/graphcodebert-base \
--do_train \
--num_train_epochs 50 \
--code_length 384 \
--data_flow_length 384 \
--train_batch_size 8 \
--eval_batch_size 16 \
--learning_rate 2e-5 \
--max_grad_norm 1.0 \
--evaluate_during_training \
--num_labels 250 \ # Number Classifications
--seed 123456 2>&1 | tee train.log
- Java250: https://developer.ibm.com/exchanges/data/all/project-codenet/
- Python800: https://developer.ibm.com/exchanges/data/all/project-codenet/
- Refactory: https://github.com/githubhuyang/refactory
- CodRep: https://github.com/KTH/CodRep-competition
If you use the code in your research, please cite:
@inproceedings{dong2023mixcode,
title={MixCode: Enhancing Code Classification by Mixup-Based Data Augmentation},
author={Dong, Zeming and Hu, Qiang and Guo, Yuejun and Cordy, Maxime and Papadakis, Mike and Zhang, Zhenya and Le Traon, Yves and Zhao, Jianjun},
booktitle={2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)},
pages={379--390},
year={2023},
organization={IEEE}
}