This repository contains a Multi-Modal Retrieval-Augmented Generation (RAG) Pipeline that processes both images and text locally. It leverages advanced NLP and computer vision techniques to create a powerful information retrieval and generation system.
- Local processing of images and text
- Integration with Qdrant vector store
- CLIP image embedding for efficient image retrieval
- Multi-modal index creation using LlamaIndex
- Interactive query system with both text and image results
-
Clone the repository:
git clone https://github.com/naimkatiman/Multi-Modal-RAG-Pipeline-on-Images-and-Text-Locally.git cd Multi-Modal-RAG-Pipeline-on-Images-and-Text-Locally
-
Install dependencies:
pip install -r requirements.txt
-
Prepare your data:
- Place your image files (.jpg or .png) in the
data
directory - Ensure corresponding text files (.txt) with the same name exist for each image
- Place your image files (.jpg or .png) in the
-
Run the pipeline:
python myRAG.ipynb
-
Data Preparation: The system scans the specified directory for image-text pairs.
-
Index Creation: A multi-modal index is created using LlamaIndex, storing both text and image embeddings.
-
Query Processing: Users can input queries, and the system retrieves relevant text and images.
-
Visualization: Retrieved images are displayed using matplotlib.
Here's a simple visualization of the pipeline:
graph TD
A[Data Preparation] --> B[Index Creation]
B --> C[Query Processing]
C --> D[Result Visualization]
style A fill:#f9d5e5,stroke:#333,stroke-width:2px
style B fill:#eeac99,stroke:#333,stroke-width:2px
style C fill:#e06377,stroke:#333,stroke-width:2px
style D fill:#c83349,stroke:#333,stroke-width:2px
Here are some example queries and their results:
You can customize the pipeline by modifying the following parameters in config.py
:
DATA_PATH
: Path to your image and text dataQDRANT_PATH
: Path for local Qdrant storageTOP_K
: Number of results to retrieve for each query
The Multi-Modal RAG Pipeline offers:
- Fast retrieval times (typically <100ms)
- High accuracy in matching relevant images and text
- Scalability to handle large datasets
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- LlamaIndex for the indexing framework
- Qdrant for the vector database
- CLIP for image embeddings