This repo provides the code to replicate the experiments in the paper
Xinyun Chen, Linyuan Gong, Alvin Cheung, Dawn Song, PlotCoder: Hierarchical Decoding for Synthesizing Visualization Code in Programmatic Context, in ACL 2021.
Download the JuiCe dataset here.
The code is runnable with Python 3, PyTorch 0.4.1.
To extract the plot generation samples from the entire JuiCe dataset, run plot_sample_extraction.py
.
Note that for model training and evaluation, we may further filter out some samples from the datasets extracted here. But we would like to keep these copies that include the maximal number of plot samples we can extract, so that we no longer need to enumerate the entire JuiCe dataset afterwards.
In the following we list some important arguments for data preprocessing:
--data_folder
: path to the directory that stores the data.--prep_train_data_name
: filename of the plot generation samples for training. Note that it does not include all plot samples in the original JuiCe training split: some of them are merged into the hard dev/test splits. To build the training set, make sure that this filename is notNone
.--prep_dev_data_name
,--prep_test_data_name
: filename of the plot generation samples filtered from the original dev/test splits of JuiCe. To preprocess each split of the data, make sure that the corresponding filename is notNone
.--prep_dev_hard_data_name
,--prep_test_hard_data_name
: filename of the homework or exam solutions extracted from the original training split of JuiCe. These are larger-scale sets for evaluation.--build_vocab
: set it to beTrue
for building the vocabularies of natural language words and code tokens.
- To run the hierarchical model:
python run.py --nl --use_comments --code_context --nl_code_linking --copy_mechanism --hierarchy --target_code_transform
- To run the non-hierarchical model with the copy mechanism:
python run.py --nl --use_comments --code_context --nl_code_linking --copy_mechanism --target_code_transform
- To run the LSTM decoder without the copy mechanism, i.e., one-hot encoding for data items, but preserve the nl correspondence in the input code sequence:
python run.py --nl --use_comments --code_context --nl_code_linking --target_code_transform
- To run the LSTM decoder with the standard copy mechanism, do not preserve the nl correspondence:
python run.py --nl --use_comments --code_context --copy_mechanism --target_code_transform
- To run the LSTM decoder without the copy mechanism, i.e., one-hot encoding for data items as in prior work:
python run.py --nl --use_comments --code_context --target_code_transform
In the following we list some important arguments for running neural models:
--nl
: include the previous natural language cell as the model input. Note that the current code does not support including natural language from multiple cells, because it may not make sense to add NL instructions for previous code cells instead of the current one to confuse the model.--use_comments
: include the comments in the current code cell as the model input.--code_context
: include the code context as the model input.--target_code_transform
: standardize the target code sequence into a more canonical form.--max_num_code_cells
: the number of code cells included as the code context. Default:2
. Note that setting it to0
is not equivalent to not using the code context, because it still includes: (1) the code within the current code cell before the code snippet starting to generate the plots; and (2) the code context including the data frames and their attributes.--nl_code_linking
: if a code token appears in the nl, concatenate the code token embedding with the corresponding nl embedding.--copy_mechanism
: use the copy mechanism for the decoder.--hierarchy
: use the hierarchical decoder for code generation.--load_model
: path to the trained model (not required when training from scratch).--eval
: add this command during the test time, and remember to set--load_model
for evaluation.
If you use the code in this repo, please cite the following paper:
@inproceedings{chen-2021-plotcoder,
title={PlotCoder: Hierarchical Decoding for Synthesizing Visualization Code in Programmatic Context},
author={Chen, Xinyun and
Gong, Linyuan and
Cheung, Alvin and
Song, Dawn},
booktitle={Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
year={2021}
}