MLTRANSLATOR-PY

Project Description

This project aims to create a neural machine translation model for translating text between Croatian and English. The project encompasses data collection and preprocessing, exploratory data analysis, model selection, implementation, training, evaluation, and comparison.

Project Structure

Introduction
- Problem Description: This project addresses the challenge of machine translation between Croatian and English using neural networks. Machine translation is crucial for eliminating language barriers and enabling communication between different languages.
- Project Goals: The goal is to develop a robust and efficient translation model that can accurately translate texts between Croatian and English.
Data Collection and Preprocessing
- Data Collection: Data has been collected from reliable sources, including parallel corpora of Croatian and English texts (opus100, en-hr).
- Data Cleaning: The data has been preprocessed to handle missing values, encode categorical variables, and normalize textual data.
Feature Analysis
- Exploratory Data Analysis: Data visualization to understand the distribution of text lengths, common words, and correlations.
- Feature Selection: Features such as sentence length and word frequency were selected based on the analysis.
Model Selection and Implementation
- Model Selection: Several models were considered, including MBartForConditionalGeneration. The selection was based on their performance in similar translation tasks.
- Model Implementation: Selected models were implemented using Hugging Face Transformers and PyTorch libraries.
- Cross-Validation: Cross-validation was conducted to ensure the robustness of the model.
Model Evaluation
- Performance Evaluation: Metrics such as BLEU score were used to evaluate the model's performance.
- Model Comparison: The performance of different models was compared, and the best model was selected for implementation.

Setup Instructions

Prerequisites

Ensure that CUDA 11.8 is installed.
Ensure that cuDNN for CUDA 11.8 is installed.

Setup Steps

Clone the repository:

git clone https://github.com/ILISJAK/mltranslator-py.git
cd mltranslator-py

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install required packages:
```
pip install -r requirements.txt
```
Check CUDA installation: Ensure that CUDA is properly installed and accessible. You can verify this by running:
```
nvcc --version
```
Additionally, verify that PyTorch can access the GPU:
```
import torch
print(torch.backends.cudnn.enabled)
print(torch.cuda.is_available())
```
Download and preprocess data:
```
python data/download_data.py
```
Train the model:
```
python model/train.py
```
Evaluate the model:
```
python model/evaluate.py
```
Run the web interface:
```
cd web
python app.py
```
Open your browser and navigate to http://127.0.0.1:5000 to use the translation interface.

Notes

Ensure that you have the appropriate drivers and libraries for your GPU.
Adjust batch sizes and datasets as needed to fit the capacity of your GPU memory.
The evaluation script includes plotting the BLEU score to visualize model performance.

Additional Information

Data Cleaning and Preprocessing: The data/download_data.py script downloads and preprocesses the necessary data.