/Locomotive

Toolkit for training/converting LibreTranslate compatible language models 🚂

Primary LanguagePythonGNU Affero General Public License v3.0AGPL-3.0

Locomotive

Easy to use, cross-platform toolkit to train argos-translate models, which can be used by LibreTranslate 🚂

It can also convert pre-trained Opus-MT models.

Requirements

  • Python >= 3.8
  • NVIDIA CUDA graphics card (not required, but highly recommended)

Install

git clone https://github.com/LibreTranslate/Locomotive --depth 1
cd Locomotive
pip install -r requirements.txt

Background

Language models can be trained by providing lots of example translations from a source language to a target language. All you need to get started is a set of two files (source and target). The source file containing sentences written in the source language and a corresponding file with sentences written in the target language.

For example:

source.txt:

Hello
I'm a train!
Goodbye

target.txt:

Hola
¡Soy un tren!
Adiós

You'll need a few million sentences to train decent models, and at least ~100k sentences to get some results. OPUS has a good collection of datasets to get started. You can also use any of the data sources listed on the argos-train index. Also check NLLU.

Usage

Place source.txt and target.txt files in a folder (e.g. mydataset-en_es) of your choice:

mydataset-en_es/
├── source.txt
└── target.txt

Create a config.json file specifying your sources:

{
    "from": {
        "name": "English",
        "code": "en"
    },
    "to": {
        "name": "Spanish",
        "code": "es"
    },
    "version": "1.0",
    "sources": [
        "file://D:\\path\\to\\mydataset-en_es",
        "opus://Ubuntu",
        "http://data.argosopentech.com/data-ccaligned-en_es.argosdata"
    ]   
}

Note you can specify, local folders (using the file:// prefix), internet URLs to .zip archives (using the http:// or https:// prefix) or OPUS datasets (using the opus:// prefix). For a complete list of OPUS datasets, see OPUS.md and note that they are case-sensitive.

Then run:

python train.py --config config.json

Training can take a while and depending on the size of datasets can require a graphics card with lots of memory.

The output will be saved in run/[model]/translate-[from]_[to]-[version].argosmodel.

Running out of memory

If you're running out of CUDA memory, decrease the batch_size parameter, which by default is set to 8192:

{
    "from": {
        "name": "English",
        "code": "en"
    },
    "to": {
        "name": "Spanish",
        "code": "es"
    },
    "version": "1.0",
    "sources": [
        "file://D:\\path\\to\\mydataset-en_es",
        "http://data.argosopentech.com/data-ccaligned-en_es.argosdata"
    ],
    "batch_size": 2048
}

Reverse Training

Once you have trained a model from source => target, you can easily train a reverse model target => source model by passing --reverse:

python train.py --config config.json --reverse

Tensorboard

TensorBoard allows tracking and visualizing metrics such as loss and accuracy, visualizing the model graph and other features. You can enable tensorboard with the --tensorboard option:

python train.py --config config.json --tensorboard

Tuning

The model is generated using sensible default values. You can override the default configuration by adding values directly to your config.json. For example, to use a smaller dictionary size, add a vocab_size key in config.json:

{
    "from": {
        "name": "English",
        "code": "en"
    },
    "to": {
        "name": "Spanish",
        "code": "es"
    },
    "version": "1.0",
    "sources": [
        "file://D:\\path\\to\\mydataset-en_es",
        "http://data.argosopentech.com/data-ccaligned-en_es.argosdata"
    ],
    "vocab_size": 30000
}

Using Filters and Transforms

Locomotive provides various filters, transforms and augmenters which can be used to dynamically cleanup, modify and augment the input sources before training:

{
    "filters": [
        "duplicates", 
        {"source_target_ratio": {"min": 0.6, "max": 1.5}}
    ],
    "transforms":[
        "remove_unpaired_quotes_and_brackets"
    ],
    "augmenters":[
        "single_word_punctuation"
    ],
    "sources": [
        {
            "source": "file://D:\\path\\to\\mydataset-en_es", 
            "filters": [
                {"char_length": {"min": 20}}
            ]
        }
    ]
}

Filters, transforms and augmenters can be specified globally (applied to all sources) as well as per-source (applied only to the specified source).

Using Weights

It's possible to specify weights for each source, for example, it's possible to instruct the training to use less samples for certain datasets:

{
    "sources": [
        {"source": "file://D:\\path\\to\\mydataset-en_es", "weight": 1},
        {"source": "http://data.argosopentech.com/data-ccaligned-en_es.argosdata", "weight": 5}
    ]
}

In the example above, 1 sample will be taken from mydataset and 5 will will be taken from CCAligned.

Specifying weights disables filtering, transformations and augmentations. The datasets are used as-is. No merging or shuffling is performed either. A weight of 1 can be used to instruct Locomotive to not preprocess a source.

Evaluate

You can evaluate the model by running:

python eval.py --config config.json
Starting interactive mode
(en)> Hello!
(es)> ¡Hola!
(en)>

You can also compute BLEU scores against the flores200 dataset for the model by running:

python eval.py --config config.json --bleu
BLEU score: 45.12354

Convert Helsinki-NLP OPUS MT models

Locomotive provides a convenient script to convert pre-trained models from OPUS-MT to make them compatible with LibreTranslate:

python opus_mt_convert.py -s en -t it

This will attempt to automatically find/download the OPUS-MT's model archive from https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models/ or https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/. This doesn't always work, and will not always pick the best model. You can specify a model archive manually by using the --model-url parameter:

Some models also need a beginning of sentence (BOS) token for the model to work. You can specify a BOS token by using the --bos parameter:

python opus_mt_convert.py -s en -t vi --model-url https://object.pouta.csc.fi/Tatoeba-MT-models/eng-vie/opus+bt-2021-04-10.zip --bos ">>vie<<"

To run evaluation:

python eval.py --config run/en_it-opus_1.0/config.json

The script is experimental. If you find issues, feel free to open a pull request!

Known Limitations

Some models fail to execute with int8 quantization. If you get a lot of repeated words, try to set -q float32 to keep full precision.

Contribute

Want to share your model with the world? Post it on community.libretranslate.com and we'll include in future releases of LibreTranslate. Make sure to share both a forward and reverse model (e.g. en => es and es => en), otherwise we won't be able to include it in the model repository.

We also welcome contributions to Locomotive! Just open a pull request.

Use with LibreTranslate

To install the resulting .argosmodel file, locate the ~/.local/share/argos-translate/packages folder. On Windows this is the %userprofile%\.local\share\argos-translate\packages folder. Then create a [from-code]_[to-code] folder (e.g. en_es). If it already exists, delete or move it.

Extract the contents of the .argosmodel file (which is just a .zip file, you might need to change the extension to .zip) into this folder. Then restart LibreTranslate.

You can also install .argosmodel packages from Python:

import pathlib
import argostranslate.package
package_path = pathlib.Path("/root/translate-en_it-2_0.argosmodel")
argostranslate.package.install_from_path(package_path)

Credits

In no particular order, we'd like to thank:

For making Locomotive possible.

License

AGPLv3