Machine-Translation-System

A Simple Machine Translation System that utilizes a pre-trained model from HuggingFace, specifically the mBART-50 model.

To do task

1. Usage

Requirements

If you do not install Anaconda for Python, our app may still work, but package installation could be more difficult. You should use Python 3.7 <= 3.xx <= 3.10

Conda:
```
$ conda create -n env python=3.xx anaconda
$ conda activate env
```
To leave the environment after you no longer need it:
```
$ conda deactivate
```
How to check the packages whether available in current channels: https://conda.anaconda.org/conda-forge/osx-arm64/

Mac/Linux Users:

$ python -m pip install --user --upgrade pip
$ python -m pip install --user virtualenv
$ python -m venv venv

To activate the virtual environment:

source venv/bin/activate

To leave the environment after you no longer need it:

$ deactivate

Install

Docker environment is recommended for installation:

docker-compose build dev
docker-compose run --rm dev

Dependencies

All of the dependencies are listed in requirements.txt. To reduce the likelihood of environment errors, install the dependencies inside a virtual environment with the following steps.
```
    $ pip install -r requirements.txt
```

requirements.txt

numpy
matplotlib
tqdm
torch==2.0.0 
torchvision==0.15.0 
torchaudio==2.0.0
wrapt

# Rule-based Machine Translation
urbans

# Statistical Machine Translatation
bs4
nltk==3.8.1

# mBART - Neural Machine Translation
fsspec==2023.9.2
datasets==2.14.6
sentencepiece==0.1.97 
sacrebleu==2.3.1
transformers==4.26.1
protobuf==3.20.1

# Deployment
fastapi
pydantic
flask
uvicorn

Config

Run the following commands in terminal:

    python train.py --model_name 'Transformer' 
        --device 'cpu'
        --model_type 'unigram' 
        --src_lang 'vi' 
        --tgt_lang 'en' 
        --num_heads 8 
        --num_layers 6 
        --d_model 512 
        --d_ff 2048 
        --drop_out 0.1 
        --seq_len 150    
        --batch_size 16

    $ bash bash/transformer.sh

To run our machine translation system, you can run the following commands in terminal:

    $ python app.py

2. Ruled-based Machine Translation

We use Translator built from the urbans library. For more details about URBANS, you can visit the URBANS GitHub page.

3. Statistical Machine Translation

Dataset

The EVBCorpus - English-Vietnamese Parallel corpus

Model

IBM Model 1 with Expectation Maximization algorithm

4. Neural Machine Translation

Dataset

The IWSLT'15 English-Vietnamese data is used from Stanford NLP group.

For all experiments the corpus was split into training, development and test set:

Data set	Sentences	Download
Training	133,317	via GitHub or located in `data/train.en` or `data/train.vi`
Development	1,553	via GitHub or located in `data/validation.en` or `data/validation.vi`
Test	1,268	via GitHub or located in `data/test.en` or `data/test.vi`

Beam search

5. Deploy Model with FastAPI, Docker and Heroku

We utilize FastAPI (docs) for deploying our machine translation model. FastAPI is a powerful web framework that has gained popularity for several compelling reasons:
- High Performance
- Automatic API Documentation
- Type Annotations and Validation
- Asynchronous Support
- Dependency Injection System
- Security Features
- Easy Integration with Pydantic Models

STATIC_FILES_DIR = './templates/'
app = FastAPI()

class InputModel(BaseModel):
    sentence: str
    grammar: str
    method: int

@app.get('/')
async def home():
    return FileResponse(
        STATIC_FILES_DIR + "home.html",
        headers={"Cache-Control": "no-cache"}
    )

@app.post('/predict')
def predict(input: InputModel):
    if input.method == 0: # Rule-based Machine Translation
        translator = Translator(src_grammar = process_grammar(input.grammar),
                        src_to_tgt_grammar = src_to_target_grammar,
                        src_to_tgt_dictionary = en_to_vi_dict)
        language = translator.translate(input.sentence) 
    elif input.method == 1: # Statistical Machine Translation
        language = smt_translate_vi2en(input.sentence)
    elif input.method == 2: # Neural Machine Translation
        language = translate_vi2en(input.sentence)
    return {
        "output": language
    }
app.mount("/", StaticFiles(directory=STATIC_FILES_DIR, html=True))

Demo

RBMT

SMT

NMT

References

Official repositories:
- VinAI Research BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese) (paper)
- Pytorch Beam Search
- Pytorch Documentation
Paper:
Tutorial:
- A Github Gist explaining how to setup README.md properly
- FastAPI in Containers - Docker
- Heroku: Deploying with Git
Colaboratory:
- Statistical Machine Translation: English to Hindi
Others:
- NLP versus NLU versus NLG

quanvparadium/Machine-Translation-System

Machine-Translation-System

To do task

1. Usage

Requirements

Install

Dependencies

Config

2. Ruled-based Machine Translation

3. Statistical Machine Translation

Dataset

Model

4. Neural Machine Translation

Dataset

Beam search

5. Deploy Model with FastAPI, Docker and Heroku

Demo

RBMT

SMT

NMT

References