Traitement multilingue de textes / Multilingual Text Processing Centre de recherche en technologies numériques / Digital Technologies Research Centre Conseil national de recherches Canada / National Research Council Canada Copyright 2008-2022, Sa Majeste la Reine du Chef du Canada Copyright 2008-2022, Her Majesty in Right of Canada MIT License - see LICENSE Structure d'entraînement pour Portage-SMT-TAS La structure d'entraînement pour Portage-SMT-TAS ici présente est conçue pour remplir une double fonction: elle se veut le point de départ pour les travaux expérimentaux avec Portage-SMT-TAS, ainsi que le point de départ, une fois configurée correctement, pour l'entraînement de systèmes de traduction automatique de production. Portage-SMT-TAS training framework Framework authors: Samuel Larkin, Darlene Stewart and Eric Joanis, with suggestions by George Foster This experimental framework is intended as a starting point for experiments with PortageII. It uses recommended settings for a reasonable experimental starting point. In practice, our settings change all the time, at NRC, as we continue our research on Statistical Machine Translation, so no framework can truly represent the current state of the art. Nonetheless, we've tried to present in this framework our currently recommended practice. This is also the basis framework used to train production SMT systems with PortageII, when properly configured. A detailed tutorial is now available to accompany this framework. Please see tutorial.pdf. (Run "make doc" if you only see tutorial.tex.) QUICK START GUIDE The following describes the quickest way to get started with this framework. SOFTWARE SETUP: Make sure you have PortageII installed and available in your $PATH. Also make sure you have IRSTLM properly compiled and set up, which means that you have set $IRSTLM to the location of your installed IRSTLM. See the INSTALL file in PortageII for more details and more dependencies. Typing "make help" will give you sample commands you can cut and paste to do so. Alternatively, if you have SRILM installed, set LM_TOOLKIT=SRI in Makefile.params or LM_TOOLKIT=MIT to use MITLM's toolkit. TRAINING: Start each system you want to train from a fresh copy of the framework: cp -pr $PORTAGE/framework <new-framework-instance> or git clone https://github.com/nrc-cnrc/PortageTrainingFramework.git <new-framework-instance> Then you will need to copy or symlink your training files into corpora. Your files should have the pattern <PREFIX>_<LANG>.raw, where <LANG> is two letters representing the language for that file. These files should contain one sentence per line, in original truecase, tokenized and sentence aligned with the matching lines in the other language, except for the file used for the language model. This framework requires two pairs of tuning files containing around 2000 sentences each, and two pairs of test files. It also requires training copora for the language model and the translation models, which are kept compressed (.gz) since they are expected to be large. Here's what your corpora directory should look like: corpora/dev1_en.raw corpora/dev1_fr.raw corpora/dev2_en.raw corpora/dev2_fr.raw corpora/lm-train_fr.raw.gz corpora/Makefile <= provided by the framework. corpora/Makefile.params <= provided by the framework. corpora/test1_en.raw corpora/test1_fr.raw corpora/test2_en.raw corpora/test2_fr.raw corpora/tm-train_en.raw.gz corpora/tm-train_fr.raw.gz As an example, dev1_en.raw is an english file with dev1 as its <PREFIX>. If you decide not to use the default file names for any of the previous files, you must edit Makefile.params accordingly. Replace the prefixes of PRIMARY_LM (or TRAIN_LM or MIXLM), TRAIN_TM and the other TRAIN_* variables, as well as the various TUNE_* variables, and finally TEST_SET, to reflect your file names. Note that these variables are not the full file name but rather the <PREFIX> of each file or file pair. If you are building a system to translate languages other than English->French, you will need to modify SRC_LANG and TGT_LANG in Makefile.params. You will also notice that that <LANG> = {SRC_LANG, TGT_LANG} and are preferably two letter tokens representing your source and target languages. At this point "make all" should do all the work, from training all models to getting BLEU scores, tuning decoding weights, and optionally rescoring weights and a confidence model, in the process. ONCE YOU HAVE A TRAINED SYSTEM: Once your system is trained, you might want to translate some other documents. These documents don't need an aligned counterpart but must still be tokenized and have one sentence per line. To translate a new document, simply run: ./translate.sh <YOUR_NEW_TOKENIZED_DOCUMENT> TO DEPLOY TO PORTAGELIVE: Once your system is trained, you can also install it on a PortageLive server for regular use. Start with make portageLive and then follow the instructions in your PortageII distro's PortageLive directory. TROUBLE SHOOTING GUIDE: This part tries to solve simple mistakes that prevent you from running successfully this framework. It covers some issues we've seen in the past, but it is not intended to be complete. It is always good to check the commands you are about to run by typing: > make all -n where -n tells make not to run the commands. Also, we've tried to create log.* files, these should indicate what was the problem so you always read them first. 1. When running the corpora module: Error message: make: *** No rule to make target `lm-train1_fr.lc.gz', needed by `lc'. Stop. This is the most common mistake and usally means there is a prefix mismatch between the filename and the prefix for either TRAIN_LM, TRAIN_TM, TUNE_DECODE, TUNE_RESCORE or TEST_SET in ./Makefile.params. The solution is simply to match the prefix with the filename in the following manner: filename: <prefix>_<language>.raw i.e. test_en.raw prefix: <prefix> i.e. test If you have the proper prefix but it still fails, make sure that your file's extension is .raw for TUNE_* & TEST_SET and .raw.gz for TRAIN_*. 2. When running the models/lm module: Make sure you have successfully run "make -C corpora all" or refer to 1. If you are using IRSTLM: Error message: make: *** No rule to make target `lm-train_fr-kn-5g.binlm.gz', needed by `binlm'. Stop. Then try: > make -C models/lm mark and if you get the following error: make: *** No rule to make target `lm-train_fr.marked.gz', needed by `mark'. Stop. most likely, you didn't successfully run "make -C corpora". Error message: /bin/sh: line 3: add-start-end.sh: command not found make: *** [lm-train_fr.marked.gz] Error 127 This means that add-start-end.sh is not in your path. Make sure that you have properly installed IRSTLM and that it is present in your path. > which add-start-end.sh If you are using SRILM: Error message: make: *** [lm-train_fr-kn-5g.lm.gz] Error 127 make sure ngram-count is in your path by typing: > which ngram-count Make sure you have correctly installed SRILM. Either SRILM or IRSTLM: Error message: make: *** [lm-train_fr-kn-5g.binlm.gz] Error 127 which mostly likely indicates that your PortageII setup is incomplete. You can verify that you have access to PortageII programs by typing: > which arpalm2binlm Make sure you have correctly installed PortageII. 3. When running the models/tc module: Make sure you have successfully run "make -C corpora all" or refer to 1. Make sure that you were successful in running the models/lm module since the tc module relies heavily on it. 4. When running the models/tm module: First, make sure the corpora module was successfully completed or refer to 1. 5. When running the models/decode module: First, make sure the models/lm and models/tm modules were successfully completed or refer to 2 or 4. 6. When running the models/rescore module: First, make sure the models/decode module was successfully completed. 7. When running the translate module: First, make sure the models/decode and possibly models/rescore modules were successfully completed. 8. General commands If nothing works, you can always try the following enormously verbose "hardcore debugger's" command in the faulty module: > make -n -prd | less First, this will not run the command because of -n and secondly, this will output all make knows.
nrc-cnrc/PortageTrainingFramework
Framework for training Portage Statistical Machine Translation models — Structure pour entraîner des modèles de traduction automatique statistique Portage
MakefileMIT