/cs598-tpcds

Primary LanguageJupyter Notebook

The State of the Text-to-SQL LLM models: Insights From TPC-DS Benchmark

This repository contains the artifacts of the research work titled, "The State of the Text-to-SQL LLM models: Insights From TPC-DS Benchmarks". The work is done as part of Fall 2023 UIUC CS598 course project taught by Prof Daniel Kang.

Abstract

In the evolving data analytics domain, the integration of Large Language Models (LLMs) is transforming SQL-based data interactions and retrieval. This research introduces a comprehensive framework to evaluate Text-to-SQL LLMs using the industry-standard Transaction Processing Performance Council's Decision Support specification (TPC-DS) specification, that effectively simulates a data warehouse environment. We meticulously transform TPC-DS benchmark queries into natural language queries (NLQs) to form a unique dataset for Text-to-SQL model input. Recognizing the inadequacy of conventional complexity metrics, especially for nested and intricate subqueries, we introduce new SQL query complexity rules tailored for practical industrial and enterprise applications. This end-to-end framework is designed with an ultimate goal to evaluate LLMs' effectiveness in SQL query generation and their accuracy on actual databases. Despite advances, our findings reveal significant challenges in applying Text-to-SQL models to complex, real-world scenarios, with issues like binding errors, parsing difficulties, and semantic disparities in LLM-generated queries. While queries generated by GPT4 and SQLCoder-34B show up to 13.1$\times$ and 481.2$\times$, respectively, faster response times compared to handcrafted queries, the end-to-end execution is significantly hindered (upto 22.7$\times$ and 281.96$\times$ for GPT4 and SQLCoder-34B, respectively) by the LLMs' inference time, underscoring the necessity of further optimizing in LLM generation for practical considerations.

Quick Start

In order to start using our work, all you need to do is install the requirements file, download the required datasets and then work with respective set of files. Setting up the enviroment is as simple as

#pull nvidia docker container and set up your directory where you want to persist your data in host location.  
docker pull nvcr.io/nvidia/pytorch:23.09-py3
docker run -itd --gpus all --privileged --shm-size=16g -p 8888:8888 -v <host location>:/workspace/data --name tpcdsinsights nvcr.io/nvidia/pytorch:23.09-py3 bash
docker exec -it tpcdsinsights bash 
cd /workspace/data/
git clone https://github.com/Bhagyashreet20/cs598-tpcds.git
cd cs598-tpcds

If you want to use the TPCDS dataset in your experiments, you can find the dataset in data/dataset folder. Please note port 8888 is mapped for jupyter notebook and if you plan to use it, we highly recommend starting the juypter nodebook server inside the container using the below command line:

jupyter notebook --ip 0.0.0.0 --no-browser --allow-root

Minimum hardware and software requirements

Minimum Hardware requirements:

  • 16 core x86 CPU
  • 256GB RAM
  • NVIDIA Ampere generation or above GPU card with atleast 24GB memory.
  • 512GB of diskspace

Software requirements:

  1. The codebase has been tested with NVIDIA NGC Pytorch:23.09 containers.
  2. Install necessary packages to install DuckDB by following steps discussed in DuckDB Install
  3. We also require OpenAI API keys to access their models. Since we will be using GPT3.5 and GPT4 models in our evaluation, you must register with OpenAI and have available credits of atleast $10 to start.
  4. Install all additional libraries for python in requirement.txt

Detailed steps

Due to licensing requirements, we do not provide TPC-DS source code to generate queries or dataset publicly. One can raise a request to TPC-DS committee and get these tools.
Download TPC-DS dsdqgen tool after accepting their licensing terms. The tool is filled with errors, and we are open to providing the patch on-request. Note: If you are a core developer of the project, you can raise a request by emailing primary author.

We are providing generated golden queries only for SF100 dataset in this folder: data/dataset/SQL. We expect the user to generate SF100 dataset using the following commands. (This should take close to 1hr on a Gen4 SSD).

python data/duckdb/gen_sf_data.py

On successful generation, you should be able to see the dataset stored in parquet format in data/tpcds/tpcds_sf100. If successful generation fails, then it can be either you ran out of memory or disk space or you are using older version of duckdb.

End-to-end evaluation

Executing Golden queries

You can execute golden queries is quite easy. Please follow the below script.

python data/sql/scripts/execute_golden_queries.py

Executing Text-to-SQL LLM queries.

We split the execution of SQL generation using LLM and executing the generated SQL in DuckDB database engine in two different files for ease of use and to reduce cost. We have two different scripts created depending on the choice of the model. This is because OpenAI GPT3.5 and GPT4 are required to be accessed via end-point while the SQLCoder-34B and SQLize are local host models and requires to be accessed locally.

For GPT3.5 models:

  1. Generate SQL using scripts provided in data/GPT/gpt-3.5/semi-zero-shot/scripts/gpt-query-gen.py for semi-zero-shot prompting and using data/GPT/gpt-3.5/schema-induced/scripts/gpt-query-gen.py for schema-induced prompting.
  2. Execute the generated SQL on DuckDB using the scripts provided in data/GPT/gpt-3.5/semi-zero-shot/scripts/zeroshot-execution.py for generated semi-zero-shot prompting and data/GPT/gpt-3.5/semi-zero-shot/scripts/schema-induced-execution.py for schema-induced prompting.

For GPT4 models:

  1. Generate SQL using scripts provided in data/GPT/gpt-4/semi-zero-shot/scripts/gpt-query-gen.py for semi-zero-shot prompting and using data/GPT/gpt-4/schema-induced/scripts/gpt-query-gen.py for schema-induced prompting.
  2. Execute the generated SQL on DuckDB using the scripts provided in data/GPT/gpt-4/semi-zero-shot/scripts/zeroshot-execution.py for generated semi-zero-shot prompting and data/GPT/gpt-4/semi-zero-shot/scripts/schema-induced-execution.py for schema-induced prompting.

For SQLCoder-34B and SQLize models, a lot of manual intervension is required to extract the SQL query from the LLM response. Please follow the steps detailed below.

SQLCoder-34B steps:

If you plan to test and generate SQL for the proposed dataset using the SQLCoder-34B baseline, please follow the steps detailed in the Colab notebook or following data/sqlcoder/defog_sqlcoder_colab.ipynb

SQLize-7B steps:

If you plan to test SQLize with pretrained model, download the models from here: Google drive. Copy the models to this location of the repository

mv ~/Downloads/models.zip /workspace/data/cs598-tpcds/data/sqlize/
cd /workspace/data/cs598-tpcds/data/sqlize
unzip models.zip

Once the models are downloaded (~15GB), please follow the steps discussed in data/sqlize/fine-tuning/inference.ipynb notebook.

SQLize Training and Fine-tuning

  1. Download the spider_wikisql_dataset.json dataset from Google drive. Copy the downloaded file to data/dataset-spider/spider_wikisql_dataset.json
  2. Pretraining with DAPT and PEFT can be executed by following the data/sqlize/pre-training/pre_training.ipynb notebook.
  3. After pretraining (takes about 9hr to complete), we move to finetuning using SFT by following the data/sqlize/fine-tuning/fine-tuning.ipynb notebook. For this step, the proposed dataset is augmented with 10 different NLQs. This updated dataset is available in data/sqlize/fine-tuning/dataset folder.
  4. To perform inference, follow the steps described in data/sqlize/fine-tuning/inference.ipynb notebook.

Dataset keyword and complexity analysis

Complexity analysis script is available here: python scripts/datastats/complexity_analysis.py Keyword analysis script is available here: python scripts/datastats/keyword_analysis.py Presence keyword analysis script is available here: python scripts/datastats/presense_keyword_analysis.py

By default these scripts execute for the proposed dataset. To execute for spider dataset, please uncomment extract_target_dir = '/workspace/data/cs598-tpcds/data/dataset-spider/' in each of the above scripts.

Contribute

We have many improvements we'd like to implement. Please help us! If you would like to contribute, please contact us.

Contact us

Reach out to btaleka2 at illinois dot edu