/SumGenToBT

Official code of our work, Summarize and Generate to Back-Translate: Unsupervised Translation of Programming Languages [arXiv].

Primary LanguagePythonMIT LicenseMIT

Summarize and Generate to Back-translate

Official code release of our work, Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages.

SetupTrainEvaluationLicenseCitation

Setup

Setting up a conda environment is recommended to run experiments. We assume anaconda is installed.

First do

conda create --name sgb python=3.8
conda activate sgb

The additional requirements (noted in requirements.txt) can be installed by running the following script:

bash install_env.sh

Then build tree_sitter library for Java and Python languages by running:

python build.py

Finally, download the pre-trained PLBART checkpoints.

cd plbart
bash download.sh

There are two model sizes, so we can perform experiments with MODEL_SIZE=base|large.

Train

Step1. Summarization and Generation

Download data

cd data/sumgen
bash download.sh
bash prepare.sh

Training SG model

cd sumgen
bash run.sh GPU_ID [MODEL_SIZE]

Step2. Back-translation

Download data

TBD

Training BT model

cd plbart
bash train.sh GPU_ID [MODEL_SIZE]

Evaluation

Evaluate SumGen model

cd sumgen/evaluation
bash decode.sh GPU_ID SOURCE TARGET MODEL_SIZE BEAM_SIZE
bash evaluate.sh SAVE_DIR SOURCE TARGET

For example, run the following commands to get results with default settings.

cd sumgen/evaluation
# to evaluate base model
bash decode.sh 0 java python base 10
bash evaluate.sh base_java_python_b10 java python
# to evaluate large model
bash decode.sh 0 java python large 10
bash evaluate.sh large_java_python_b10 java python

Evaluate PLBART

cd scripts
bash run.sh GPU_ID

Results

License

Contents of this repository is under the MIT license. The license applies to the pre-trained and fine-tuned models as well.

Citation

If you use any of the datasets, models or code modules, please cite the following paper:

@article{ahmad2022sumgen,
  author    = {Wasi Uddin Ahmad and Saikat Chakraborty and Baishakhi Ray and Kai-Wei Chang},
  title     = {Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages},
  journal   = {CoRR},
  volume    = {abs/2205.11116},
  year      = {2022},
  url       = {https://arxiv.org/abs/2205.11116},
  eprinttype = {arXiv},
  eprint    = {2205.11116}
}