Text-to-SQL Copilot

Text-to-SQL Copilot is a tool to support users who see SQL databases as a barrier to actionable insights. Taking your natural language question as input, it uses a generative text model to write a SQL statement based on your data model. Then runs it on your database and analyses the results. And it does this all at no cost using HuggingFace Inference API.

Setup

Dataset

This was built specifically off of the Spider dataset. Follw these steps to recreate:

Download the data from this Google Drive
Unzip the file
Save the root 'spider' folder under the src/data/raw/ directory

Setup Process

This application pulls the schema information from the SQLite database files and utilizes a locally stored Chroma Vector database to identify which schema to use to answer questions. Run the following commands to compile the database info and build the vector database:

pip3 install -r requirements.txt

python3 setup.py

This will take about 10-15 minutes to fully run.

HuggingFace API Token

Currently, this project relies on the google flan-t5-xxl languauge model. It is accessed for free through the HuggingFace Inference API. In order to use this method, you need to create an API token and save in in a .env file in the root of the repo:

touch .env

Open the .env file and enter your HuggingFace API token:

Using SQL Copilot

Navigate to the src/app directory and start the program with the following command:

python3 main.py

Then input your question - happy SQL-ing!

Citation

Chase, H. (2022). LangChain [Computer software]. https://github.com/hwchase17/langchain

Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., ... & Radev, D. (2018). Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887.