This repository contains an implementation of a Retrieval Augmented Generation (RAG) system for question answering using the CRAG dataset and the Llama3 8B instructed model.
The RAG system combines a retrieval component and a language model to generate accurate and contextually relevant answers to questions. The retrieval component searches a corpus of documents to find relevant passages related to the input question. These retrieved passages are then used to condition the language model during the generation process, providing the necessary context to produce a well-informed response.
The CRAG (Comprehensive RAG) Dataset is utilized for building the RAG. This dataset includes question-answer pairs spanning five domains: Finance, Sports, Music, Movies, and Open Encyclopedia. The diverse domains cover a spectrum of information change rates, from rapid (Finance and Sports) to gradual (Music and Movies) and stable (Open Encyclopedia).
The project consists of the following main components:
-
RAG Pipeline: The core implementation of the Retrieval Augmented Generation system is contained in the
RAG_pipeline_final.ipynb
Jupyter Notebook. This notebook includes the retrieval and generation stages, utilizing Llama3, Faiss, and SentenceTransformers for efficient retrieval and context-aware generation. -
Data Cleaning Pipeline: The
data_cleaning_pipeline.ipynb
Jupyter Notebook handles the data cleaning and preprocessing steps required for the CRAG dataset.
To run this project locally, follow these steps:
-
Clone the repository:
git clone https://github.com/RavicharanN/CRAG-Benchmark.git