A Simple Machine Translation System that utilizes a pre-trained model from HuggingFace, specifically the mBART-50 model.
- User Interface
- Ruled-based Machine Translation
- Statistical Machine Translation
- Neural Machine Translation
- Deploy Model with FastAPI, Docker and Heroku
- Report and Slide
If you do not install Anaconda for Python, our app may still work, but package installation could be more difficult. You should use Python 3.7 <= 3.xx <= 3.10
-
Conda:
$ conda create -n env python=3.xx anaconda $ conda activate env
To leave the environment after you no longer need it:
$ conda deactivate
How to check the packages whether available in current channels: https://conda.anaconda.org/conda-forge/osx-arm64/
-
Mac/Linux Users:
$ python -m pip install --user --upgrade pip $ python -m pip install --user virtualenv $ python -m venv venv
To activate the virtual environment:
source venv/bin/activate
To leave the environment after you no longer need it:
$ deactivate
Docker environment is recommended for installation:
docker-compose build dev
docker-compose run --rm dev
- All of the dependencies are listed in
requirements.txt
. To reduce the likelihood of environment errors, install the dependencies inside a virtual environment with the following steps.$ pip install -r requirements.txt
- requirements.txt
numpy matplotlib tqdm torch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 wrapt # Rule-based Machine Translation urbans # Statistical Machine Translatation bs4 nltk==3.8.1 # mBART - Neural Machine Translation fsspec==2023.9.2 datasets==2.14.6 sentencepiece==0.1.97 sacrebleu==2.3.1 transformers==4.26.1 protobuf==3.20.1 # Deployment fastapi pydantic flask uvicorn
Run the following commands in terminal:
python train.py --model_name 'Transformer'
--device 'cpu'
--model_type 'unigram'
--src_lang 'vi'
--tgt_lang 'en'
--num_heads 8
--num_layers 6
--d_model 512
--d_ff 2048
--drop_out 0.1
--seq_len 150
--batch_size 16
Or
$ bash bash/transformer.sh
To run our machine translation system, you can run the following commands in terminal:
$ python app.py
We use Translator built from the urbans library. For more details about URBANS, you can visit the URBANS GitHub page.
The EVBCorpus - English-Vietnamese Parallel corpus
IBM Model 1 with Expectation Maximization algorithm
The IWSLT'15 English-Vietnamese data is used from Stanford NLP group.
For all experiments the corpus was split into training, development and test set:
Data set | Sentences | Download |
---|---|---|
Training | 133,317 | via GitHub or located in data/train.en or data/train.vi |
Development | 1,553 | via GitHub or located in data/validation.en or data/validation.vi |
Test | 1,268 | via GitHub or located in data/test.en or data/test.vi |
- We utilize FastAPI (docs) for deploying our machine translation model. FastAPI is a powerful web framework that has gained popularity for several compelling reasons:
- High Performance
- Automatic API Documentation
- Type Annotations and Validation
- Asynchronous Support
- Dependency Injection System
- Security Features
- Easy Integration with Pydantic Models
STATIC_FILES_DIR = './templates/'
app = FastAPI()
class InputModel(BaseModel):
sentence: str
grammar: str
method: int
@app.get('/')
async def home():
return FileResponse(
STATIC_FILES_DIR + "home.html",
headers={"Cache-Control": "no-cache"}
)
@app.post('/predict')
def predict(input: InputModel):
if input.method == 0: # Rule-based Machine Translation
translator = Translator(src_grammar = process_grammar(input.grammar),
src_to_tgt_grammar = src_to_target_grammar,
src_to_tgt_dictionary = en_to_vi_dict)
language = translator.translate(input.sentence)
elif input.method == 1: # Statistical Machine Translation
language = smt_translate_vi2en(input.sentence)
elif input.method == 2: # Neural Machine Translation
language = translate_vi2en(input.sentence)
return {
"output": language
}
app.mount("/", StaticFiles(directory=STATIC_FILES_DIR, html=True))
-
Official repositories:
- VinAI Research BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese) (paper)
- Pytorch Beam Search
- Pytorch Documentation
-
Paper:
- Statistical Vs Rule Based Machine Translation: A Case Study on Indian Language Perspective
- Neural Machine Translation by Jointly learning to align and translate
- Multilingual Denoising Pre-training for Neural Machine Translation
- BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
- Pre-Trained Models: Past, Present and Future
-
Tutorial:
- A Github Gist explaining how to setup README.md properly
- FastAPI in Containers - Docker
- Heroku: Deploying with Git
-
Colaboratory:
- Statistical Machine Translation: English to Hindi
-
Others: