This repository provides a guide to building a question-answering system using multimodal retrieval augmented generation (RAG). The system leverages Google's Gemini models, the Vertex AI API, and text embeddings to perform Q&A over documents containing both text and images.
Retrieval augmented generation (RAG) enhances the capabilities of large language models (LLMs) by providing access to external data, thus improving their knowledge base and mitigating hallucinations. This notebook demonstrates how to implement a multimodal RAG system to perform Q&A over a document filled with both text and images.
This notebook guides you through the steps to build a question-answering system using the Vertex AI Gemini API and text embeddings. The system extracts data from documents, stores it in a vector store, searches the store with text queries, and generates answers using the Gemini Pro Model.
- Extract data from documents containing both text and images using Gemini Vision Pro.
- Generate embeddings of the data and store them in a vector store.
- Search the vector store with text queries to find relevant data.
- Generate answers to user queries using the Gemini Pro Model.
To get started, you need a Google Cloud project with the Vertex AI API enabled. This section will guide you through the necessary setup and installation steps.
-
Install Required Packages:
- Install packages such as
pymupdf
,langchain
,gradio
,google-cloud-aiplatform
, andlangchain_google_vertexai
.
- Install packages such as
-
Restart Runtime:
- After installing the packages, restart the runtime to ensure all packages are loaded correctly.
-
Google Colab Authentication:
- If running on Google Colab, authenticate your environment using Google Colab's authentication methods.
-
Vertex AI Workbench:
- If using Vertex AI Workbench, skip the authentication step as it is not required.
-
Define Project Information:
- Specify your Google Cloud project ID and location.
-
Initialize Vertex AI:
- Initialize the Vertex AI environment with the specified project information.
-
Download Sample PDF and Images:
- Download a sample PDF file and a default image to use when no results are found.
-
Process PDF to Images:
- Split the PDF into images by rendering each page as an image using the
fitz
library.
- Split the PDF into images by rendering each page as an image using the
-
Extract Data Using Gemini Vision Pro:
- Load each image and use the Gemini Vision Pro model to extract text and tabular data from the images.
-
Store Extracted Information:
- Store the extracted information in a Big Query for further processing.
-
Initialize Text Embedding Model:
- Use the
textembedding-gecko
model to generate embeddings for the extracted text data.
- Use the
-
Generate Text Embeddings:
- Create a function to generate text embeddings and apply it to the extracted text data to create a list of embeddings.
-
Store Embeddings:
- Store the generated embeddings in a Big Query along with the corresponding text and image references.
-
Create Vector Search Index:
- Define parameters for the vector search index, including the number of dimensions and distance measure type.
-
Save Embeddings to JSON:
- Save the embeddings in JSONL format and upload them to a Google Cloud Storage bucket.
-
Create and Deploy Index:
- Create a vector search index using the Vertex AI Matching Engine and deploy it to an endpoint.
-
Generate Query Embeddings:
- Create a function to generate embeddings for the user's query.
-
Find Relevant Documents:
- Use the vector search endpoint to find the most relevant documents based on the query embeddings.
-
Generate Answer:
- Create a function to generate an answer by using the relevant documents as context for the Gemini Pro model.
-
Handle Multiple Attempts:
- Implement logic to handle multiple attempts at finding a satisfactory answer, using the next most relevant document if necessary.
-
Set Up Gradio Interface:
- Use Gradio to create a web-based interface for the question-answering system.
-
Define User Interactions:
- Define how user inputs (queries) are processed and how the results (answers and images) are displayed.
-
Launch Gradio App:
- Launch the Gradio app to make the question-answering system accessible via a web interface.
- Vertex AI documentation
- LangChain documentation
- Gradio documentation
This README provides a comprehensive overview and detailed instructions for setting up and running the multimodal RAG-based question-answering system using Google Cloud's Vertex AI and Gemini models. For more details, refer to the provided documentation links.