This repository contains the dataset used in the paper COGS: A Compositional Generalization Challenge Based on Semantic Interpretation, accepted to EMNLP 2020.
The dataset can be found under data/
. dev
and test
contain in-distribution examples and gen
contains the generalization examples discussed in our paper. We used examples in train
for our main experiment and train_100
for the additional experiment in the appendix with increased number of exposure examples.
We used OpenNMT for our experiments. This repo contains the version of OpenNMT that we used. You can follow the instructions below to rerun our experiments.
First, reformat the .tsv format dataset into format used by the OpenNMT preprocessing code:
python reformat_data_for_opennmt.py --input_path path_to_data --output_path output_path
Second, run OpenNMT preprocess. output_path should be the same as the output of the previous step, and opennmt_path should point to the src/OpenNMT directory in this repo.
bash opennmt_preprocess.sh output_path opennmt_path
See scripts named run_x.sh
under scripts/
, where x is one of transformer
, lstm_uni
, lstm_bi
. Run with the same commandline arguments as the preprocessing step:
bash run_transformer.sh output_path opennmt_path
Please refer to the individual script files for the hyperparameter settings for each model and random seed.
After running the script, the predictions of the model for in-distribution dev/test sets and out-of-distribution generalization set will be saved under $OPENNMT_PATH/pred/
as a .tsv file. It will be a tab-delimited file with 3 columns: source, target, model_prediction.