This repository contains the scripts to train neuronal translation models for OpenNMT and also the Softcatalà published models.
For more information about training see the TRAINING document.
The corpus used to train these models are available here: https://github.com/Softcatala/parallel-catalan-corpus/
And here the tools that at Softcatalà to serve these models in production: https://github.com/Softcatala/nmt-softcatala
Language pair | SC model BLEU | SC Flores101 BLEU | Google BLEU | Meta NLLB200 BLEU | Opus-MT BLEU | Sentences | Download model |
---|---|---|---|---|---|---|---|
German-Catalan | 34.8 | 28.9 | 35.5 | 30.7 | 18.5 | 3142257 | deu-cat-2022-11-14.zip |
Catalan-German | 28.5 | 25.4 | 32.9 | 29.1 | 15.8 | 3142257 | cat-deu-2022-11-16.zip |
English-Catalan | 46.8 | 43.1 | 46.0 | 41.7 | 29.8 | 4741504 | eng-cat-2022-11-09.zip |
Catalan-English | 46.6 | 43.3 | 47.0 | 48.0 | 29.6 | 4741504 | cat-eng-2022-11-12.zip |
French-Catalan | 41.3 | 31.6 | 37.3 | 33.3 | 27.2 | 2566302 | fra-cat-2022-11-09.zip |
Catalan-French | 41.4 | 35.4 | 41.7 | 39.6 | 27.9 | 2566302 | cat-fra-2022-11-14.zip |
Galician-Catalan | 74.1 | 31.4 | 36.5 | 33.2 | N/A | 2710149 | glg-cat-2022-11-17.zip |
Catalan-Galician | 80.7 | 31.9 | 33.1 | 31.7 | N/A | 2710149 | cat-glg-2022-11-21.zip |
Italian-Catalan | 39.7 | 26.5 | 30.6 | 27.8 | 22.0 | 2584598 | ita-cat-2022-11-11.zip |
Catalan-Italian | 36.2 | 24.5 | 27.5 | 26.0 | 19.2 | 2584598 | cat-ita-2022-11-15.zip |
Japanese-Catalan | 24.3 | 17.0 | 23.4 | N/A | N/A | 1974248 | jpn-cat-2022-11-18.zip |
Catalan-Japanese | 26.2 | 19.8 | 32.5 | N/A | N/A | 1974248 | cat-jpn-2022-11-19.zip |
Dutch-Catalan | 30.4 | 20.3 | 27.1 | 24.8 | 15.8 | 2208538 | nld-cat-2022-11-19.zip |
Catalan-Dutch | 27.6 | 18.2 | 23.4 | 21.8 | 13.4 | 2208538 | cat-nld-2022-11-19.zip |
Occitan-Catalan | 74.9 | 32.5 | N/A | 36.2 | N/A | 2711350 | oci-cat-2022-11-17.zip |
Catalan-Occitan | 78.8 | 28.9 | N/A | 27.8 | N/A | 2711350 | cat-oci-2022-11-21.zip |
Portuguese-Catalan | 41.6 | 33.9 | 38.7 | 34.5 | 28.1 | 2043019 | por-cat-2022-11-16.zip |
Catalan-Portuguese | 39.0 | 32.3 | 40.0 | 36.5 | 27.5 | 2043019 | cat-por-2022-11-18.zip |
Spanish-Catalan | 88.8 | 22.6 | 23.6 | 25.8 | 22.5 | 7596985 | spa-cat-2022-11-16.zip |
Catalan-Spanish | 87.5 | 24.2 | 24.2 | 25.5 | 23.2 | 7596985 | cat-spa-2022-11-17.zip |
Legend:
- SC Model BLEU column indicates the Softcatalà models' BLEU score against the corpus test dataset (from train/dev/test)
- SC Flores101 BLEU column indicates the Softcatalà models' BLEU score against Flores101 benchmark dataset. This provides an external evaluation
- Google BLEU is the BLUE score of Google Translate using the Flores101 benchmark
- Opus-MT BLEU is the BLUE score of the Opus-MT models using the Flores101 benchmark (our ambition is to outperform them)
- Sentences is the number of sentences in the corpus used for training
- Meta NLLB200 refers to nllb-200-3.3B model from Meta. This is a very slow model and it's distilled version performs significantly worse.
Notes:
- All models are based on TransformerRelative and SentencePiece has been used as tokenizer.
- We use Sacrebleu to calculate BLUE scores with the 13a tokenizer.
- These models are used in production with modest hardware (CPU). As result, these models are a balance between precision and latency. It is possible to further improve BLUE scores by ~+1 BLEU, but at a significant latency cost at inference.
- BLEU is the most popular metric for evaluating machine translation but also broadly acknowledged that it is not perfect. It's estimated that has a ~80% correlation with human judgment
- Flores101 has some limitations. It was produced translating from English to the other 100 languages. When you use flores for example to benchmark Catalan - Spanish translations, consider that the Catalan -> Spanish corpus was produced by translating from English to Catalan and from English to Spanish. The resulting Spanish and Catalan translations are different from what a translator will do translating directly from Spanish to Catalan. As a summary, Flores101 is more reliable for benchmarks where English is the source or target language.
- Occitan model is based on Languedocian variant
Description of the directories on the contained in the models zip file:
- tensorflow: model exported in Tensorflow format
- ctranslate2: model exported in CTranslate2 format (used for inference)
- metadata: description of the model
- tokenizer: SentencePiece models for both languages
You can use the models with https://github.com/OpenNMT/CTranslate2 which offers fast inference.
At Softcatalà we built also command line tools to translate TXT and PO files. See: https://github.com/Softcatala/nmt-softcatala/tree/master/use-models-tools
Download the model and unpack it:
https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/eng-cat-2022-11-09.zip
unzip eng-cat-2022-11-09.zip
Install dependencies:
pip3 install ctranslate2 pyonmttok
Simple translation using Python:
import ctranslate2
translator = ctranslate2.Translator("eng-cat/ctranslate2/")
translator.translate_batch([["▁Hello", "▁world", "!"]])
[[{'tokens': ['▁Hola', '▁món', '!']}]]
Simple tokenization & translation using Python:
import pyonmttok
tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = "eng-cat/tokenizer/sp_m.model")
tokenized=tokenizer.tokenize("Hello world!")
import ctranslate2
translator = ctranslate2.Translator("eng-cat/ctranslate2/")
translated = translator.translate_batch([tokenized[0]])
print(tokenizer.detokenize(translated[0][0]['tokens']))
Hola món!