Portage-SMT-TAS: A Raku repository from National Research Council of Canada — Conseil national de recherches du Canada - National Research Council of Canada

                              Portage-SMT-TAS
 
Traitement multilingue de textes / Multilingual Text Processing
Centre de recherche en technologies numériques / Digital Technologies Research Centre
Conseil national de recherches Canada / National Research Council Canada
Copyright 2004-2021, Sa Majesté la Reine du Chef du Canada
Copyright 2004-2021, Her Majesty in Right of Canada

MIT License - see LICENSE

See NOTICE for the Copyright notices of 3rd party libraries.

Français: LISEZMOI


                                Description

Portage is a project to explore new techniques in Statistical Machine
Translation (SMT), and to develop state-of-the-art SMT technology that can be
used commercially, as well as to support other projects at NRC and elsewhere.

Various SMT systems were developed under the Portage project.  PORTAGEshared
was first released to Canadian universities to build and train a community of
users.  Commercial releases of PORTAGEshared 1.1 to 1.3 were made, followed by
Portage 1.4.

PortageII marked a leap forward in NRC's SMT technology.  With version 1.0, we
brought in significant improvements to translation engine itself that result in
better translations, and contributed to our success at the NIST Open Machine
Translation 2012 Evaluation (OpenMT12).  With version 2.0, we added two
important features to help translators: the transfer of markup tags from the
source sentence to the machine translation output, as well as handling of the
increasingly common xliff file format.  2.1 was a maintenance release while 2.2
brought in the fixed-terms module.

PortageII 3.0 marked another important step in the core SMT technology, bringing
in deep learning (with Neural Network Joint Models, or NNJMs), an improved
reordering model based on sparse features, coarse LMs and other improvement.
The NRC put significant efforts in optimizing the default training
parameters to improve the quality of the translations produced by PortageII 3.0
out of the box, and we expect users to see and appreciate the difference.

PortageII 4.0 brought in incremental document adaptation and training NNJMs on
your own data, with fully updated generic models.

Portage-SMT-TAS is the name we chose for the GitHub repo, now that we are
releasing the code base as open source under the MIT License. Despite the rise
of NMT, the Portage code base still includes a number of tools that can be
useful to the world of Machine Translation and parallel text processing.  All
the names above might be found in various parts of the documentation and code,
and can be construed to refer to the same system, or one of its versions.  When
you see PortageII_cur, that's a placeholder that our script for creating an
official release replaces by the actual version name and number.

Note that some of the Portage tools have been spliced into other GitHub repos:
 - https://github.com/nrc-cnrc/PortageTextProcessing contains text processing
   tools that are relevant for both NMT and SMT work.
 - https://github.com/nrc-cnrc/PortageClusterUtils contains the multi-core and
   high-performance cluster parallelization tools that were developped for
   Portage, but can be used for other projects as well.

Portage-SMT-TAS includes a complete suite of software tools required for
phrase-based statistical machine translation, including:
 - preprocessing of French, English, Spanish, Danish (tokenization,
   detokenization, lowercasing, aligning), Chinese (handling of dates and
   numbers), and Arabic (integration with MADA tokenizer);
 - training lexical translation models (IBM models 1 and 2, HMM, plus
   integration with MGiza++ for IBM4) from aligned corpora;
 - training phrase-based translation models from aligned corpora and lexical
   models;
 - training lexicalized distortion models, and their hierarchical variant;
 - training sparse features, including discriminative hierarchical distortion
   models;
 - using language models in text and binary format for various purposes;
 - training, fine tuning and using Neural Network Joint Models (NNJM) to
   improve translation quality;
 - filtering language and translation models to retain only information needed
   for a particular text to be translated;
 - adapting language and translation models to reflect the characteristics of
   the data currently being translated;
 - decoding (doing the actual translation, producing n-best lists or lattices);
 - rescoring (choosing the best hypothesis from an n-best list) using sources
   of information that are external to the decoder; a collection of rescoring
   features;
 - optimizing weights, both for decoding and for rescoring, using a variety of
   tuning algorithms, including N-Best MERT, Lattice and N-Best MIRA and a few
   more;
 - truecasing (restoring upper case letters in the output of the translation);
 - evaluation using BLEU, WER or PER.
 - a tightly packed representation of various model files (LMs, TMs, (H)LDMs,
   suffix arrays) for instant loading and fast access via memory-mapped I/O;
 - a confidence estimation module to compute a per-sentence quality
   prediction/estimation of the translations produced by Portage;
 - an API to incorporate the decoder into other applications using SOAP or cgi;
   a web-ready demo;
 - PortageLive -- tools to setup a runtime SMT server using Portage, accessible
   via a SOAP API, via CGI scripts or a command line interface;
 - incremental document model adaptation;
 - code to handle translation projects in the TMX and XLIFF format, with
   optional transfer of formatting tags from the source text to the output.
 - an experimental framework -- a collection of tools to automate running all
   the components required to train a full machine translation system (see
   https://github.com/nrc-cnrc/PortageTrainingFramework);
 - generic en->fr and fr->en models to supplement in-domain training corpora,
   for improved translation quality, especially for smaller corpora;
 - lots of general utility code needed for the above.


                         Contents of this directory

   bin/         executable code (once compiled)
   doc/         documentation (once generated)
   framework/   framework for training PortageII and doing experiments [TODO - this will be a separate GitHub repo]
   generic-model/ location to install PortageGenericModel 2.0 [TODO - where will this be?]
   lib/         run-time libraries (once compiled)
   logs/        place-holder for runtime logs
   PortageLive/ files needed to setup a runtime translation server
   src/         source code
   test-suite/  various test suites
   third-party/ non-NRC software distributed with PortageII for convenience
   tmx-prepro/  tools to extract and preprocess training corpora from TMX files
                [TODO - this will be a separate GitHub repo]
   INSTALL      installation instructions
   NOTICE       third party Copyright notices
   README       this file
   RELEASES     release history with change notes
   SETUP.bash   config file to activate Portage-SMT-TAS


                              Getting started

The framework subdirectory [TODO - fix] provides a PortageII training and experiment
framework.  We recommend you use this framework as your starting point for real
experiments, and as your default configuration for training PortageII systems
for use with PortageLive.  A tutorial is included showing how to run a toy
experiment with this framework: tutorial.pdf.  Even if you used PortageII 2.2
or earlier before, we recommend you read through this document, as it
highlights the currently recommended procedures for using PortageII.  Please do
not use our old toy example as a starting point: start from the framework
instead.

For further documentation, look at the contents of doc/.


                     Upgrading from a previous version

If you have models that were created using a version of PortageII older than
3.0, we strongly recommend that you retrain new models with the current
version, to take advantage of the new features.

Between PortageII 2.x and PortageII 3.0, we have conducted extensive
experiments to optimize the training procedures, and updated the framework
defaults to reflect the configuration we now recommend.  If you have customized
or created templates from the framework's Makefile.params file, please update
them to reflect the changes in PortageII 3.0.  If you use custom plugins,
please review the new default plugins, which might meet your needs out of the
box.

We now recommend (and enable by default) the following settings:
 - Use the new sparse features.
 - Use the new Coarse LM models.
 - Use IBM4 word-alignments, along with IBM2 and HMM ones (require MGiza++).
 - If the target language is French or English, use the generic LM as
   background model in a MixLM with your main language model (note: this will
   not introduce out-of-domain terminology in your translations, rather, it
   will help choose among competing alternatives from your own in-domain data).

For the language model, the framework now has a PRIMARY_LM variable that you
can use instead of TRAIN_LM or MIXLM: when you use it, the framework will
automatically use the generic LM if you're translating into French or English,
and it will automatically use MixLMs instead of regular ones as soon as there
is more than one component LM or if the generic LM applies.

A matrix showing the evolution of the SOAP API, plugins and training features
is now provided: review doc/PortageAPIComparison.pdf when upgrading.

The PortageLive SOAP API has changed significantly with version 3.0 as well.
Review the updated PortageLiveAPI.wsdl, doc/PortageAPIComparison.pdf and the
php scripts in PortageLive/www/soap when updating your applications to work
with the PortageII 3.0. When installing the updated API, remove compiled WSDL
files cached by Apache in /tmp/wsdl-*.
nrc-cnrc/Portage-SMT-TAS