REDSQL

This README provides comprehensive instructions for setting up the environment, downloading the dataset, and running REDSQL.

Dataset Structure

The dataset is organized into the following directory structure:

.
├── bird
│   ├── database             # Database directory
│   ├── dev_annotation.json  # Generated annotations
│   ├── dev.json            # Development set input
│   ├── dev.sql             # Ground truth SQL queries
│   └── dev_tables.json     # Database schema
├── spider
│   ├── database
│   ├── dev_annotation.json
│   ├── dev_gold.sql
│   ├── dev.json
│   └── dev_tables.json
├── preds
│   └── Predicted_SQLs      # SQL predictions from baseline methods (e.g., PURPLE, Codes)
...

Dataset Components

  • bird: Contains the BIRD dataset files including database, annotations, development set, ground truth SQL queries, and schema information.
  • spider: Contains the Spider dataset files with similar structure to BIRD.
  • preds: Contains SQL predictions from various baseline methods.

Environment Setup

1. System Requirements

sudo apt-get update
sudo apt-get install -y openjdk-11-jdk

2. Create Conda Environment

conda create -n red python=3.9
conda activate red

3. Install Dependencies

# Install PyTorch
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch

# Install NMSLib
conda install -c conda-forge nmslib

# Install remaining requirements
pip install -r requirements.txt

Usage Instructions

1. Directory Setup

mkdir output logs

2. Build Value Search Index

python -m pre_processing.build_contents_index \
    --output_dir=./index/bird/db_contents_index/ \
    --db_dir=./datasets/bird/dev_database/

3. Generate Annotations (Optional)

Note: This step can be skipped as we provide pre-generated annotations for the following datasets:

  • BIRD
  • Science
  • Spider

These annotations are available in our open-source repository.

If you need to generate annotations for a custom dataset, use the following command:

python -m pre_processing.doc \
    --model_name=gpt-4o-2024-08-06 \
    --output_file=./annotation.json \
    --table_file=./datasets/bird/dev_tables.json \
    --db_dir=./datasets/bird/database/

4. Run REDSQL

python -m main.run \
    --model_name=model_name \
    --batch_size=2 \
    --exp_name=exp_name \
    --bug_fix \
    --consistency_num=30 \
    --stage=dev \
    --preds=/path/to/predicted/sql.txt \
    --db_content_index_path=/path/to/db/content/index \
    --annotation=/path/to/dev_annotation.json \
    --output_dir=./output \
    --dev_file=/path/to/dev.json \
    --table_file=/path/to/dev_tables.json \
    --db_dir=/path/to/database

Command Line Arguments

Argument Description
--model_name Name of the Language Model to use
--batch_size Batch size for processing (default: 2)
--exp_name Name of the experiment
--bug_fix Enable bug fixing functionality
--bug_only Only fix SQL when errors are detected
--consistency_num Number of consistency checks (default: 30)
--stage Processing stage (e.g., 'dev')
--preds Path to predicted SQL statements
--db_content_index_path Path to database content index
--annotation Path to annotation file
--output_dir Directory for output files
--dev_file Path to development set file
--table_file Path to table schema file
--db_dir Path to database directory

Note: Ensure all required files are in place and paths are correctly configured before running the commands.