
Utilisation du RAG pour maximiser la précision de la récupération des données pertinentes à partir des documents des institutions publiques.

Primary LanguagePython

Zindi AI RAG Documentation


This project involves two main scripts for managing a vector database with document embeddings and querying them using language models

Directory Structure

├── data
│   ├── document1.pdf
│   ├── document2.docx
│   └── ...
├── test_files
│   ├── contexts.csv
│   └── ...
├── tools
│   ├── utils.py
│   └── ...
├── main.py
├── zindi.py
├── requirements.txt
└── README.md


Ensure you have Python installed on your machine. The project is tested with Python 3.7 and above. You will need the following packages:

  • langchain
  • langchain_community
  • langchain_core
  • faiss-cpu for FAISS (use faiss-gpu if GPU is available)


  • Create a .env file in the root directory of the project and provide your API keys:


  1. Clone the repository:

    git clone <repository-url>
    cd <repository-name>
  2. Install required Python packages:

    pip install -r requirements.txt
  3. Install required Python packages:

     streamlit run app.py


Step 1: Ingest Documents

Place all the documents (PDF and DOCX formats) that you want to ingest into the vector database inside the data folder. The ingestion script will process these documents and populate the vector database.

Step 2: Execute the Ingestion Script

Run the following command to ingest documents and create the vector database:

python ingestion.py

This script will:

  • Check if the vector database already exists. If not, it will ingest documents from the data folder.
  • Initialize the embedding model using HuggingFace embeddings.
  • Load the vector database using FAISS.

Step 3: Retrieve Relevant Contexts

After the ingestion process, you can retrieve relevant contexts by executing the zindi.py script. This script will generate relevant contexts based on your query and store them in a CSV file inside the test_files folder.

Run the following command to execute the context retrieval:

python zindi.py

This script:

  • Takes a query as input.
  • Retrieves the top 5 relevant contexts from the vector database.
  • Stores these contexts in a CSV file inside the test_files folder.