PortageTrainingFramework: A Makefile repository from National Research Council of Canada — Conseil national de recherches du Canada - National Research Council of Canada

Traitement multilingue de textes / Multilingual Text Processing
Centre de recherche en technologies numériques / Digital Technologies Research Centre
Conseil national de recherches Canada / National Research Council Canada
Copyright 2008-2022, Sa Majeste la Reine du Chef du Canada
Copyright 2008-2022, Her Majesty in Right of Canada

MIT License - see LICENSE

Structure d'entraînement pour Portage-SMT-TAS

La structure d'entraînement pour Portage-SMT-TAS ici présente est conçue pour
remplir une double fonction: elle se veut le point de départ pour les travaux
expérimentaux avec Portage-SMT-TAS, ainsi que le point de départ, une fois
configurée correctement, pour l'entraînement de systèmes de traduction
automatique de production.

Portage-SMT-TAS training framework

Framework authors: Samuel Larkin, Darlene Stewart and Eric Joanis, with
suggestions by George Foster

This experimental framework is intended as a starting point for experiments
with PortageII. It uses recommended settings for a reasonable experimental
starting point. In practice, our settings change all the time, at NRC, as we
continue our research on Statistical Machine Translation, so no framework can
truly represent the current state of the art. Nonetheless, we've tried to
present in this framework our currently recommended practice.

This is also the basis framework used to train production SMT systems with
PortageII, when properly configured.

A detailed tutorial is now available to accompany this framework. Please see
tutorial.pdf. (Run "make doc" if you only see tutorial.tex.)

QUICK START GUIDE

The following describes the quickest way to get started with this framework.

SOFTWARE SETUP:

Make sure you have PortageII installed and available in your $PATH. Also make
sure you have IRSTLM properly compiled and set up, which means that you have
set $IRSTLM to the location of your installed IRSTLM. See the INSTALL file in
PortageII for more details and more dependencies. Typing "make help" will give
you sample commands you can cut and paste to do so. Alternatively, if you have
SRILM installed, set LM_TOOLKIT=SRI in Makefile.params or LM_TOOLKIT=MIT to use
MITLM's toolkit.

TRAINING:

Start each system you want to train from a fresh copy of the framework:
cp -pr $PORTAGE/framework <new-framework-instance>
or
git clone https://github.com/nrc-cnrc/PortageTrainingFramework.git <new-framework-instance>

Then you will need to copy or symlink your training files into corpora. Your
files should have the pattern <PREFIX>_<LANG>.raw, where <LANG> is two letters
representing the language for that file. These files should contain one
sentence per line, in original truecase, tokenized and sentence aligned with
the matching lines in the other language, except for the file used for the
language model. This framework requires two pairs of tuning files containing
around 2000 sentences each, and two pairs of test files. It also requires
training copora for the language model and the translation models, which are
kept compressed (.gz) since they are expected to be large. Here's what your
corpora directory should look like:

corpora/dev1_en.raw
corpora/dev1_fr.raw
corpora/dev2_en.raw
corpora/dev2_fr.raw
corpora/lm-train_fr.raw.gz
corpora/Makefile <= provided by the framework.
corpora/Makefile.params <= provided by the framework.
corpora/test1_en.raw
corpora/test1_fr.raw
corpora/test2_en.raw
corpora/test2_fr.raw
corpora/tm-train_en.raw.gz
corpora/tm-train_fr.raw.gz

As an example, dev1_en.raw is an english file with dev1 as its <PREFIX>.

If you decide not to use the default file names for any of the previous files,
you must edit Makefile.params accordingly. Replace the prefixes of PRIMARY_LM
(or TRAIN_LM or MIXLM), TRAIN_TM and the other TRAIN_* variables, as well as
the various TUNE_* variables, and finally TEST_SET, to reflect your file names.
Note that these variables are not the full file name but rather the <PREFIX> of
each file or file pair.

If you are building a system to translate languages other than English->French,
you will need to modify SRC_LANG and TGT_LANG in Makefile.params. You will
also notice that that <LANG> = {SRC_LANG, TGT_LANG} and are preferably two
letter tokens representing your source and target languages.

At this point "make all" should do all the work, from training all models to
getting BLEU scores, tuning decoding weights, and optionally rescoring weights
and a confidence model, in the process.

ONCE YOU HAVE A TRAINED SYSTEM:

Once your system is trained, you might want to translate some other documents.
These documents don't need an aligned counterpart but must still be tokenized
and have one sentence per line. To translate a new document, simply run:
./translate.sh <YOUR_NEW_TOKENIZED_DOCUMENT>

TO DEPLOY TO PORTAGELIVE:

Once your system is trained, you can also install it on a PortageLive server
for regular use. Start with
make portageLive
and then follow the instructions in your PortageII distro's PortageLive
directory.

TROUBLE SHOOTING GUIDE:
This part tries to solve simple mistakes that prevent you from running
successfully this framework. It covers some issues we've seen in the past, but
it is not intended to be complete.

It is always good to check the commands you are about to run by typing:
> make all -n
where -n tells make not to run the commands.

Also, we've tried to create log.* files, these should indicate what was the
problem so you always read them first.

1. When running the corpora module:
Error message:
make: *** No rule to make target `lm-train1_fr.lc.gz', needed by `lc'. Stop.
This is the most common mistake and usally means there is a prefix mismatch
between the filename and the prefix for either TRAIN_LM, TRAIN_TM, TUNE_DECODE,
TUNE_RESCORE or TEST_SET in ./Makefile.params. The solution is simply to match
the prefix with the filename in the following manner:
filename: <prefix>_<language>.raw i.e. test_en.raw
prefix: <prefix> i.e. test
If you have the proper prefix but it still fails, make sure that your file's
extension is .raw for TUNE_* & TEST_SET and .raw.gz for TRAIN_*.

2. When running the models/lm module:
Make sure you have successfully run "make -C corpora all" or refer to 1.

If you are using IRSTLM:
Error message:
make: *** No rule to make target `lm-train_fr-kn-5g.binlm.gz', needed by `binlm'. Stop.
Then try:
> make -C models/lm mark
and if you get the following error:
make: *** No rule to make target `lm-train_fr.marked.gz', needed by `mark'. Stop.
most likely, you didn't successfully run "make -C corpora".

Error message:
/bin/sh: line 3: add-start-end.sh: command not found
make: *** [lm-train_fr.marked.gz] Error 127
This means that add-start-end.sh is not in your path. Make sure that you have
properly installed IRSTLM and that it is present in your path.
> which add-start-end.sh

If you are using SRILM:
Error message:
make: *** [lm-train_fr-kn-5g.lm.gz] Error 127
make sure ngram-count is in your path by typing:
> which ngram-count
Make sure you have correctly installed SRILM.

Either SRILM or IRSTLM:
Error message:
make: *** [lm-train_fr-kn-5g.binlm.gz] Error 127
which mostly likely indicates that your PortageII setup is incomplete.
You can verify that you have access to PortageII programs by typing:
> which arpalm2binlm
Make sure you have correctly installed PortageII.

3. When running the models/tc module:
Make sure you have successfully run "make -C corpora all" or refer to 1.

Make sure that you were successful in running the models/lm module since the tc
module relies heavily on it.

4. When running the models/tm module:
First, make sure the corpora module was successfully completed or refer to 1.

5. When running the models/decode module:
First, make sure the models/lm and models/tm modules were successfully
completed or refer to 2 or 4.

6. When running the models/rescore module:
First, make sure the models/decode module was successfully completed.

7. When running the translate module:
First, make sure the models/decode and possibly models/rescore modules were
successfully completed.

8. General commands
If nothing works, you can always try the following enormously verbose "hardcore
debugger's" command in the faulty module:
> make -n -prd | less
First, this will not run the command because of -n and secondly, this will
output all make knows.