Official code release of our work, Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages.
Setup • Train • Evaluation • License • Citation
Setting up a conda environment is recommended to run experiments. We assume anaconda is installed.
First do
conda create --name sgb python=3.8
conda activate sgb
The additional requirements (noted in requirements.txt) can be installed by running the following script:
bash install_env.sh
Then build tree_sitter
library for Java and Python languages by running:
python build.py
Finally, download the pre-trained PLBART checkpoints.
cd plbart
bash download.sh
There are two model sizes, so we can perform experiments with MODEL_SIZE=base|large.
cd data/sumgen
bash download.sh
bash prepare.sh
cd sumgen
bash run.sh GPU_ID [MODEL_SIZE]
TBD
cd plbart
bash train.sh GPU_ID [MODEL_SIZE]
cd sumgen/evaluation
bash decode.sh GPU_ID SOURCE TARGET MODEL_SIZE BEAM_SIZE
bash evaluate.sh SAVE_DIR SOURCE TARGET
For example, run the following commands to get results with default settings.
cd sumgen/evaluation
# to evaluate base model
bash decode.sh 0 java python base 10
bash evaluate.sh base_java_python_b10 java python
# to evaluate large model
bash decode.sh 0 java python large 10
bash evaluate.sh large_java_python_b10 java python
cd scripts
bash run.sh GPU_ID
Contents of this repository is under the MIT license. The license applies to the pre-trained and fine-tuned models as well.
If you use any of the datasets, models or code modules, please cite the following paper:
@article{ahmad2022sumgen,
author = {Wasi Uddin Ahmad and Saikat Chakraborty and Baishakhi Ray and Kai-Wei Chang},
title = {Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages},
journal = {CoRR},
volume = {abs/2205.11116},
year = {2022},
url = {https://arxiv.org/abs/2205.11116},
eprinttype = {arXiv},
eprint = {2205.11116}
}