/youtube-rag

🎥 Building a simple RAG (Retrieval-Augmented Generation) application using Pinecone and OpenAI's API. The application will allow you to ask questions about any YouTube video.

Primary LanguageJupyter Notebook

Building a RAG application from scratch

This is a step-by-step guide to building a simple RAG (Retrieval-Augmented Generation) application using Pinecone and OpenAI's API. The application will allow you to ask questions about any YouTube video.

Training Video HERE

Tech Stack

  • OpenAI
  • Langchain
  • openai-whisper
  • scikit-learn
  • langchain-pinecone (Vector Store)
  • colab

Setup

  1. In this tutorial, use in-memory vector store, which needs extra installation pip install "langchain[docarray]" and Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/ which I didn't follow, I skip this part and directly use pinecone instead.
  2. Create a virtual environment and install the required packages:
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt

For whisper installation, use pip install git+https://github.com/openai/whisper.git instead of pip install whisper.

  1. Create a free Pinecone account and get your API key from here. If you don't have choice for regin setting, we are probably same using Iowa, US.
    So here is setting PINECONE_API_ENV="us-central1-gcp"

  2. Create a .env file with the following variables:

OPENAI_API_KEY = [ENTER YOUR OPENAI API KEY HERE]
PINECONE_API_KEY = [ENTER YOUR PINECONE API KEY HERE]
PINECONE_API_ENV = [ENTER YOUR PINECONE API ENVIRONMENT HERE]
  1. Bug fix I did report issue in Author's github, HERE. Instead of using PineconeVectorStore, got unAuth error, I use Pinecone directly.
from langchain_pinecone import Pinecone

import os
os.environ['PINECONE_API_KEY'] = "PINECONE_API_KEY"
index_name = "youtube-index"

pinecone = Pinecone.from_documents( index_name = index_name,
                                    documents = documents,
                                    embedding = embeddings)

💖 Conclusion, this is good tutorial for start learnning Langchain whisper, audio transcription, and RAG. I recommend it.