Multi-Modal RAG Pipeline on Images and Text Locally

This repository contains a Multi-Modal Retrieval-Augmented Generation (RAG) Pipeline that processes both images and text locally. It leverages advanced NLP and computer vision techniques to create a powerful information retrieval and generation system.

🌟 Features

Local processing of images and text
Integration with Qdrant vector store
CLIP image embedding for efficient image retrieval
Multi-modal index creation using LlamaIndex
Interactive query system with both text and image results

🚀 Quick Start

Clone the repository:

git clone https://github.com/naimkatiman/Multi-Modal-RAG-Pipeline-on-Images-and-Text-Locally.git
cd Multi-Modal-RAG-Pipeline-on-Images-and-Text-Locally

Install dependencies:
```
pip install -r requirements.txt
```
Prepare your data:
- Place your image files (.jpg or .png) in the data directory
- Ensure corresponding text files (.txt) with the same name exist for each image
Run the pipeline:
```
python myRAG.ipynb
```

📚 How It Works

Data Preparation: The system scans the specified directory for image-text pairs.
Index Creation: A multi-modal index is created using LlamaIndex, storing both text and image embeddings.
Query Processing: Users can input queries, and the system retrieves relevant text and images.
Visualization: Retrieved images are displayed using matplotlib.

Here's a simple visualization of the pipeline:

graph TD
    A[Data Preparation] --> B[Index Creation]
    B --> C[Query Processing]
    C --> D[Result Visualization]
    style A fill:#f9d5e5,stroke:#333,stroke-width:2px
    style B fill:#eeac99,stroke:#333,stroke-width:2px
    style C fill:#e06377,stroke:#333,stroke-width:2px
    style D fill:#c83349,stroke:#333,stroke-width:2px

🖼️ Example Queries

Here are some example queries and their results:

"Who is Naim?"
"What is Ai?"
"Where is Space?"

🛠️ Configuration

You can customize the pipeline by modifying the following parameters in config.py:

DATA_PATH: Path to your image and text data
QDRANT_PATH: Path for local Qdrant storage
TOP_K: Number of results to retrieve for each query

📊 Performance

The Multi-Modal RAG Pipeline offers:

Fast retrieval times (typically <100ms)
High accuracy in matching relevant images and text
Scalability to handle large datasets

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgements

LlamaIndex for the indexing framework
Qdrant for the vector database
CLIP for image embeddings

Made with ❤️ by Naim Katiman