Text-to-SQL Copilot is a tool to support users who see SQL databases as a barrier to actionable insights. Taking your natural language question as input, it uses a generative text model to write a SQL statement based on your data model. Then runs it on your database and analyses the results. And it does this all at no cost using HuggingFace Inference API.
This was built specifically off of the Spider dataset. Follw these steps to recreate:
- Download the data from this Google Drive
- Unzip the file
- Save the root 'spider' folder under the src/data/raw/ directory
This application pulls the schema information from the SQLite database files and utilizes a locally stored Chroma Vector database to identify which schema to use to answer questions. Run the following commands to compile the database info and build the vector database:
pip3 install -r requirements.txt
python3 setup.py
This will take about 10-15 minutes to fully run.
Currently, this project relies on the google flan-t5-xxl languauge model. It is accessed for free through the HuggingFace Inference API. In order to use this method, you need to create an API token and save in in a .env file in the root of the repo:
touch .env
Open the .env file and enter your HuggingFace API token:
Navigate to the src/app directory and start the program with the following command:
python3 main.py
Then input your question - happy SQL-ing!
Chase, H. (2022). LangChain [Computer software]. https://github.com/hwchase17/langchain
Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., ... & Radev, D. (2018). Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887.