Educational LLM RAG Bot

An educational application that evaluates and compares different Large Language Models (LLMs) on mathematical problem-solving tasks. The project implements a complete pipeline for training, evaluation, and comparison of LLaMA, Qwen, and Mistral models.

Overview

This project provides:

Custom training pipeline for fine-tuning LLMs on mathematical problems
Comprehensive evaluation framework for model comparison
Memory-optimized implementation for resource-constrained environments
Visualization tools for performance analysis

Models Used

LLaMA 3.2B Instruct - Efficient 3B parameter model
Qwen 7B Instruct - High-performance 7B parameter model
Mistral 7B Instruct - Advanced 7B parameter model
Dataset: School Math Questions

Project Structure

.
├── test_dataset.py        # Model evaluation implementation
├── testing_llms.py        # Training and fine-tuning pipeline
├── llms_visuals.py        # Visualization utilities
├── requirements.txt       # Project dependencies
└── README.md             # This file

Setup and Installation

Clone the repository:

git clone https://github.com/yourusername/educational-llm-math-solver.git
cd educational-llm-math-solver

Install dependencies:

pip install -r requirements.txt

Download models:

# Run the download script (if provided) or manually download from HuggingFace
python download_models.py

Usage

Training

from testing_llms import LocalModelTrainer

# Initialize trainer
trainer = LocalModelTrainer(
    model_path="path_to_model",
    num_train_examples=50
)

# Train and evaluate
trainer.train_and_evaluate()

Evaluation

from test_dataset import ModelEvaluator

# Initialize evaluator
evaluator = ModelEvaluator([
    "checkpoints/epoch_10_llama",
    "checkpoints/epoch_10_qwen",
    "checkpoints/epoch_10_mistral"
])

# Run evaluation
evaluator.evaluate_all_checkpoints()

Visualization

from llms_visuals import visualize_results

# Generate visualization
visualize_results()

Final Run

streamlit run app.py

Results

Model performance comparison (F1 Scores):

Model	F1 Score
Qwen-7B-Instruct	0.6255
LLaMA-3.2B-Instruct	0.5371
Mistral-7B-Instruct	0.2531

Features

Memory Optimization:
- Gradient checkpointing
- Dynamic memory cleanup
- Device-specific optimizations
Evaluation Metrics:
- F1 score calculation
- Response accuracy assessment
- Processing speed monitoring
Visualization Tools:
- Training progress plots
- Model comparison charts
- Performance metrics visualization

Requirements

Python 3.8+
PyTorch 2.0+
Transformers 4.30+
Additional requirements in requirements.txt

Performance Optimization

The implementation includes several optimization techniques:

# Memory management
if hasattr(torch.mps, 'set_per_process_memory_fraction'):
    torch.mps.set_per_process_memory_fraction(0.7)

# Gradient checkpointing
model.gradient_checkpointing_enable()

# Batch size optimization
train_dataloader = DataLoader(
    train_dataset, 
    batch_size=1,
    shuffle=True,
    num_workers=0
)

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

HuggingFace for model and dataset hosting
PyTorch team for the deep learning framework
Original model creators (LLaMA, Qwen, Mistral)

harshraj32/deep_learning_project