MIRAGE is a state-of-the-art retrieval augmented generation system that intelligently leverages data from diverse sources such as Wikipedia and Bing. By employing advanced natural language processing with Spacy/NLTK, MIRAGE deconstructs the gathered information into sentences, which are then encoded using a selected sentence transformer. These sentence embeddings are indexed using autofaiss, forming the backbone of the retrieval system. A standout feature of MIRAGE is its ability to not only retrieve the most relevant sentence embeddings for any given query but also to provide additional context by delivering a specified number of sentences preceding and succeeding the matched sentence, thereby enriching the user's comprehension and the accuracy of responses.
Begin by ensuring Python is installed on your system, then install the necessary dependencies with the following command:
pip install -r requirements.txt
To initiate the data collection process:
-
Open
src/config.txt
and modifywiki_terms
andbing_terms
with the terms you wish to search for. -
Run the following command within the
src
directory to start downloading data:python download_data.py
This command will download the data into a directory named
rag_database
.
After data collection is complete, follow these steps to build the index:
-
Inside the
src
directory, execute:python build_index.py
This will construct the sentence embeddings index and store it in the
index_folder
, which you can specify inconfig.txt
.
To conduct searches and retrieve relevant sentences:
-
Insert your query into the
query
section ofconfig.txt
. -
From the
src
directory, run:python search_with_index.py
This operation retrieves the most closely matching sentence embeddings for your query and provides
N
sentences before and after the identified sentence, offering a richer context.
To conduct a re-ranking process for the matched sentence embedding retrieved by the search operation , based on an proximity score for each embedding with the query , via a cross encoder :
-
Open
src/config.txt
and modifycross_encoder_rerank
to True. -
From the
src
directory, run:python search_with_index.py
MIRAGE is continually evolving, with plans to incorporate hypothetical document embeddings and step-back prompting to refine the retrieval process. The long-term vision involves experimenting with multimodal embeddings, potentially integrating with Meta's Imagebind project or the LLAVA project, to explore advanced reasoning and captioning capabilities across various media modalities.
We welcome contributions to MIRAGE, whether it's in the form of feature additions, bug fixes, or documentation enhancements. Feel free to fork the repository, make your changes, and submit a pull request.
MIRAGE is made available under the MIT License. You are authorized to use, modify, and distribute it subject to the terms of this license.
Your participation and feedback can help make MIRAGE an even more powerful tool for data retrieval and analysis. For any questions, suggestions, or contributions, please don't hesitate to reach out or submit a pull request. Let's advance the capabilities of retrieval augmented generation systems together with MIRAGE!