This repository contains an implementation of a Transformer model for Neural Machine Translation (NMT), designed to translate text from English to Russian. The project uses TensorFlow for the model implementation, SentencePiece for subword tokenization, and PyGAD for hyperparameter optimization via a genetic algorithm. It also includes a Flask-based web interface for interactive translation.
- Overview
- Features
- Requirements
- Installation
- Dataset
- Usage
- Model Architecture
- Results
- Contributing
- License
The NMT-Transformer
project implements a Transformer-based model for translating English sentences to Russian, based on the architecture introduced in Attention Is All You Need (Vaswani et al., 2017). The codebase includes scripts for data preprocessing, model training, hyperparameter optimization, and a web interface for real-time translation. The model is trained on a dataset of English-Russian sentence pairs and uses subword tokenization to handle diverse vocabularies efficiently.
- Transformer Model: Encoder-decoder architecture with multi-head self-attention, positional encodings, and feed-forward networks.
- Subword Tokenization: Uses SentencePiece for efficient tokenization with a vocabulary size of 12,000.
- Custom Loss and Metrics: Masked sparse categorical cross-entropy and accuracy metrics to ignore padding and start tokens.
- Hyperparameter Optimization: Genetic algorithm (PyGAD) to optimize model hyperparameters like number of layers, model dimension, and dropout rate.
- Web Interface: Flask-based UI for translating English sentences to Russian with light/dark mode support.
- Jupyter Notebook: Example notebook for loading the model and performing translations.
- Python 3.8+
- TensorFlow 2.x
- SentencePiece
- PyGAD
- Flask
- NumPy
- pandas (for dataset loading)
- A GPU is recommended for faster training.
- Clone the repository:
git clone https://github.com/NikitaGoldashevsky/NMT-Transformer.git cd NMT-Transformer
- Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install tensorflow sentencepiece pygad flask numpy pandas
- Download or prepare the dataset (
rus_300000.csv
) and place it in the project root.
The project uses a dataset of English-Russian sentence pairs stored in rus_300000.csv
. The file is expected to have two or three columns (e.g., English and Russian sentences, with an optional first column). The dataset is processed to create subword vocabularies using SentencePiece, with a maximum sequence length of 25 tokens.
To use your own dataset:
- Prepare a CSV file with English and Russian sentence pairs.
- Update the
ds_name
variable inTraining_remote.py
to point to your dataset.
- Run the training script:
This script:
python Training_remote.py
- Loads and preprocesses the dataset.
- Trains a SentencePiece tokenizer.
- Builds and trains the Transformer model for 8+2 epochs.
- Prints sample translations after each epoch using a callback.
- Saves the trained model (
my_transformer_model.keras
) and tokenizer files (bpe.model
,bpe.vocab
).
The script includes a genetic algorithm to optimize hyperparameters:
- Run the optimization section in
Training_remote.py
(requires PyGAD). - The algorithm tests combinations of
num_layers
,d_model
,num_heads
,d_ff
,dropout_rate
,learning_rate
, andbatch_size
. - Results are printed, including the best hyperparameters and validation accuracy.
- Ensure the trained model (
my_transformer_model_subword_bugfixed.keras
) and tokenizer (bpe_subword_bugfixed.model
) are in the project root. - Run the Flask app:
python Flask_Interface.py
- Open a browser and navigate to
http://127.0.0.1:5000
. - Enter an English sentence and click "Translate" to see the Russian translation and inference time.
- Open
Importing_Transformer.ipynb
in Jupyter. - Run the cells under the "Subword tokenization" section to load the model and tokenizer.
- Test translations with example sentences or your own inputs.
Example:
print(decode_sequence("What are you going to do this morning?"))
# Output: Что вы будете делать сегодня утром?
The Transformer model consists of:
- Encoder: Processes English input with multi-head self-attention, positional encodings, and feed-forward networks.
- Decoder: Generates Russian output with masked self-attention, cross-attention to encoder outputs, and feed-forward networks.
- Hyperparameters:
d_model
: 384num_heads
: 8num_layers
: 1d_ff
: 512dropout_rate
: 0.153max_len
: 25vocab_size
: 12,000
- Custom Layers:
PositionalEncoding
for sequence position information andPaddingMask
for ignoring padding tokens. - Loss: Masked sparse categorical cross-entropy.
- Optimizer: Adam with exponential decay learning rate.
The model achieves reasonable translations for short English sentences, as shown in Importing_Transformer.ipynb
. Example translations:
- "She loves to play soccer." → "Она любит играть в футбол."
- "What is your favorite color?" → "Какой твой любимый цвет?"
- "Can you help me with my homework?" → "Можешь помочь мне с домашним заданием?"
To evaluate performance quantitatively, consider computing BLEU scores using a library like sacrebleu
.
Contributions are welcome! Please:
- Fork the repository.
- Create a feature branch (
git checkout -b feature-name
). - Commit your changes (
git commit -m "Add feature"
). - Push to the branch (
git push origin feature-name
). - Open a pull request.
Suggestions for improvement:
- Add BLEU score evaluation.
- Support additional language pairs.
- Enhance the genetic algorithm with more generations or a larger population.
This project is licensed under the MIT License. See the LICENSE file for details.