NatGen: Generative Pre-training by "Naturalizing" Source Code.

Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar T Devanbu, Baishakhi Ray. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’22), November 14-18, 2022, Singapore, Singapore. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3540250.3549162.

The paper Slide Deck

Getting Started

Environment Requirements

pytorch==1.7.0 
cudatoolkit=11.1
datasets==1.18.3
transformers==4.16.2
tensorboard==2.8.0
tree-sitter==0.19.0;
nltk==3.6.7;
scipy==1.5.4;

To setup the environment. Please uncomment line 35 and 36 (or run those code in your shell).

bash run setup.sh

Download and preprocess the training data

cd scripts/pretraining;
bash process_data.sh

Data processing takes several parameters. These parameters are passed through a configuration json file. The configuration file should be in configs/pretraining/data_config directory.

Pretrain the model

cd scripts/pretraining;
bash train.sh <EXPERIMENT_NAME> <GPUS>

Adjust the per_device_train_batch_size and gradient_accumulation_steps and number of GPUS using to get the final effective batch size in the training arguments json file. per_device_train_batch_size * gradient_accumulation_steps * number of gpus. We use distributed training to pre-train.

We reused source code from various open source code repositories

CodeT5
Microsoft CodeXGLUE Out sincere thanks to the authors of these repositories for open-sourcing their work.

Citation

If you use this repository, please cite,

@inproceedings{chakraborty2022natgen,
    author = {Chakraborty, Saikat and Ahmed, Toufique and Ding, Yangruibo and Devanbu, Premkumar T. and Ray, Baishakhi},
    title = {NatGen: Generative Pre-Training by “Naturalizing” Source Code},
    year = {2022},
    isbn = {9781450394130},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3540250.3549162},
    doi = {10.1145/3540250.3549162},
    booktitle = {Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering},
    pages = {18–30},
    numpages = {13},
    keywords = {Neural Network, Semantic Preserving Transformation, Source Code Transformer, Source Code Pre-training},
    location = {Singapore, Singapore},
    series = {ESEC/FSE 2022}
}