This repository contains code to translate datasets into multiple languages using the NLLB (No Language Left Behind) model. The model can be used:
- From a server to translate a single text or entire datasets.
- Using the
translate
script to translate a single text.
Note: This repository has been tested on a Linux machine using a Nvidia GPU. The code assumes access to a GPU. Depending on your hardware, you might need to modify the code to change the number of GPUs and the batch size.
It is recommended to use a virtual environment to install the dependencies.
conda create -n instructmultilingual python=3.8.10 -y
conda activate instructmultilingual
pip install -r requirements.txt
The inference server is a FastAPI application that can be used to translate a single text or entire datasets.
For efficient inference, the model is converted using CTranslate2.
mkdir models
ct2-transformers-converter --model facebook/nllb-200-3.3B --output_dir models/nllb-200-3.3B-converted
To start the server, we need to run the following command:
uvicorn instructmultilingual.server:app --host 0.0.0.0 --port 8000
To run the server using docker, we need to build and run the docker image, and run the server.
# Build
docker build -t instruct-multilingual .
# Run
docker run -it --rm --gpus 1 -p 8000:8000 -v $(pwd):/instruct-multilingual instruct-multilingual
This script translate the samsum dataset using the inference server. It showcases how to use the inference server to translate a single text and entire datasets.
python main.py
We also provide a script to translate a single text from the CLI. This script downloads the model from Hugging Face and translates the text provided into the target language.
python -m instructmultilingual.translate \
--text="Cohere For AI will make the best instruct multilingual model in the world" \
--source_language="English" \
--target_language="Egyptian Arabic"
An example of using translate_dataset_from_huggingface_hub
to translate PIQA with the finish_sentence_with_correct_choice
template into languages used by Multilingual T5 (mT5) model
from instructmultilingual.translate_datasets import translate_dataset_from_huggingface_hub
translate_dataset_from_huggingface_hub(
repo_id = "bigscience/xP3",
train_set = ["en/xp3_piqa_None_train_finish_sentence_with_correct_choice.jsonl"],
validation_set = ["en/xp3_piqa_None_validation_finish_sentence_with_correct_choice.jsonl"],
test_set = [],
dataset_name="PIQA",
template_name="finish_sentence_with_correct_choice",
splits=["train", "validation"],
translate_keys=["inputs", "targets"],
url= "http://localhost:8000/translate",
output_dir= "/home/weiyi/instruct-multilingual/datasets",
source_language= "English",
checkpoint="facebook/nllb-200-3.3B",
)