Steps to recreate English TQ corpus annotations:
Download OpenSubtitles raw data:
- raw data for the language pair (separate files) from Opus - untokenised corpus files
- the sentence alignment file for the language pair, named alignments.src-trg.xml.gz (to be placed alongside the 'raw' directory)
OpenSubtitles2016/ | +---- raw/ | | | +---- en-fr/ | | | +---- ... | +---- alignments.en-fr.xml.gz
Install pre-processing tools
Change the paths in datasets/generic-makefile to correspond to your own paths
- OSDIR=/path/to/OpenSubtitles2016/raw
- TRUECASINGDIR=/path/to/mosesdecoder/scripts/recaser
- MAINDIR=path/to/tag-questions-opensubs
Prepare true-casing data
- concatenate Europarl and Ted-talks data (full monolingual datasets) where available for the language
- replace by space
- tokenise with MElt:
MElt -l {en, fr, de, cs} -t -x -no_s -M -K
- change the path in the language-specific Makefile to the pre-processed data used for truecasing
Extract parallel corpus
cd datasets/langpair
(e.g. de-en, fr-en, cs-en)make extract
Pre-process data
cd datasets/langpair
(e.g. de-en, fr-en, cs-en)- change the path in the language-specific Makefile. Truecasing data must be a single file for each language, tokenised (with MElt) and cleaned using the script)
- Preprocessing (cleaning, blank line removal, tokenisation, truecasing and division into sets):
make preprocess
cd subcorpora/langpair
make annotate
(to get line number of each type of tag question)make getsentences
(to extract the sentences corresponding to the line numbers)
- Store translations in
- Store all translations in translations/langpair and give them the name testset.translated.{cs,de}-en, where testset is trainsmall, devsmall or testsmall
- Czech and German to English translation (Nematus): * Download Czech and German to English systems from here (WMT'16 UEdin submissions - Sennrich et al., 2016) * Decode trainsmall, devsmall and testsmall sets using the translation scripts provided via the link just above
- French-English translation (Moses model) * Select 3M random sentences from the train dataset for training and 2k different random sentences from the same train set for tuning. * Data cleaned with MosesCleaner, duplicates removed * 3 4-gram language models trained using KenLM on (i) Europarl, (ii) Ted-talks (when available), (iii) train set of OpenSubtitles2016 * Symmetrised alignments, tuned with Kbmira
- Tokenise all translations:
MElt -l en -t -x -no_s -M -K and name as testset.translated.melttok.{cs,de}-en
cd classify/lang_pair
Models are stored in model-seq/ and model-one/