Here is the code which accompanies our NLP Project:

The src directory includes the code which we used to produce the T5 semantic parser as well as code for translating the data.

  • We attempted to reproduce the process of dangle but we were unable to get it working - Dangle's code for running RoBERTa does not work, either.
  • The Baseline.py and Baseline_Multilingual.py implement our standard T5 semantic parser and a multlingual variant.
  • Below are some command lines to use this code:

Training the monolingual model:

python Baseline.py train --cuda --train_dir ../data/train.tsv --val_dir ../data/dev.tsv --test_dir ../data/test.tsv --T5_modelname t5-base --save_dir ../models/baseline/ --epochs 200 --batch_size 64 --lr 0.0002

Evaluating the monolingual model:

python Baseline.py evaluate --cuda --test_dir ../data/gen.tsv --checkpoint_dir ../models/baseline/ --batch_size 64

Training the multilingual model:

python Baseline_Multilingual.py train --cuda --train_dir ../data/train_translated_trimmed.tsv --val_dir ../data/dev_translated_trimmed.tsv --test_dir ../data/test_translated_trimmed.tsv --T5_modelname google/mt5-small --save_dir ../models/baseline-translated/ --epochs 200 --batch_size 16 --lr 0.00002

Evaluating the multilingual model:

python Baseline_Multilingual.py evaluate --cuda --test_dir ../data/gen_translated.tsv --checkpoint_dir ../models/baseline-translated/ --batch_size 32

Training Our T5 Dangle (Buggy/broken)

python Dangle.py train --cuda --train_dir ../data/train.tsv --val_dir ../data/dev.tsv --test_dir ../data/test.tsv --T5_modelname t5-base --save_dir ../models/dangle/ --epochs 200 --batch_size 64 --lr 0.0002

The data directory contains the data on which our parsers were trained and evaluated.

  • The translated data sets are very large because they contain the translations of each sample in every language.
  • The "trimmed" translated data sets only contain one language per sample and an even distribution of languages.