The intent of this repo is to use OpenAI for question Answering based on a sql database and few other retrieval based question answering tasks using langchain.
The movie-dialog-corpus from cornell is used in this repo and its pre-processing is done along with its conversion to a relational SQL database with addition of relational db constraints such as primary key, foreign key etc.
-
Preprocessing was performed on initial corpus (movie-dialog-corpus directory) and the database is exported as a SQLite database in database directory. Notebook is available in notebooks directory for the preprocessing and conversion (preprocess_and_convert_to_sqlite_db.ipynb). (This notebook can be skipped as sqlite db file is available in repo and can be directly used in next notebook)
-
Question Answering is done using OpenAI's davinci model along with langchain, the approach is described in the notebook. question_answering_on_sql_database.ipynb . This notebook uses the sqlite database created in previous notebook.
-
For fetching/scraping data from the movie script urls and to create FAISS based vector database indexes using OpenAI Embeddings refer the notebook fetch_movie_scripts_and_create_indexes.ipynb. This notebook can be skipped and the generated files can be downloaded from Google drive folder.
-
For querying from openai for any query after getting relevent indexes from vector indexes created in previous notebook, use notebook querying_from_openai_after_retrieval_from_indexes
-
For using Agents and joining multiple tools together and building a basic question answering system joining both the SQL database module and Vector based querying on movie scripts and additional tools, go through the notebook using_agents_for_qa_on_sql_and_vectordb.ipynb
SQLite database file moviesdb.db can be viewed by using a viewer for SQLite database such as https://sqlitebrowser.org/
To view database in python, run the below code in same directory as database
import sqlite3
import pandas as pd
con = sqlite3.connect("moviesdb.db")
df = pd.read_sql_query("SELECT * from movie_titles", con)