REDSQL

This README provides comprehensive instructions for setting up the environment, downloading the dataset, and running REDSQL.

Dataset Structure

The dataset is organized into the following directory structure:

.
├── bird
│   ├── database             # Database directory
│   ├── dev_annotation.json  # Generated annotations
│   ├── dev.json            # Development set input
│   ├── dev.sql             # Ground truth SQL queries
│   └── dev_tables.json     # Database schema
├── spider
│   ├── database
│   ├── dev_annotation.json
│   ├── dev_gold.sql
│   ├── dev.json
│   └── dev_tables.json
├── preds
│   └── Predicted_SQLs      # SQL predictions from baseline methods (e.g., PURPLE, Codes)
...

Dataset Components

bird: Contains the BIRD dataset files including database, annotations, development set, ground truth SQL queries, and schema information.
spider: Contains the Spider dataset files with similar structure to BIRD.
preds: Contains SQL predictions from various baseline methods.

Environment Setup

1. System Requirements

sudo apt-get update
sudo apt-get install -y openjdk-11-jdk

2. Create Conda Environment

conda create -n red python=3.9
conda activate red

3. Install Dependencies

# Install PyTorch
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch

# Install NMSLib
conda install -c conda-forge nmslib

# Install remaining requirements
pip install -r requirements.txt

Usage Instructions

1. Directory Setup

mkdir output logs

2. Build Value Search Index

python -m pre_processing.build_contents_index \
    --output_dir=./index/bird/db_contents_index/ \
    --db_dir=./datasets/bird/dev_database/

3. Generate Annotations (Optional)

Note: This step can be skipped as we provide pre-generated annotations for the following datasets:

BIRD

Science

Spider

These annotations are available in our open-source repository.

If you need to generate annotations for a custom dataset, use the following command:

python -m pre_processing.doc \
    --model_name=gpt-4o-2024-08-06 \
    --output_file=./annotation.json \
    --table_file=./datasets/bird/dev_tables.json \
    --db_dir=./datasets/bird/database/

4. Run REDSQL

python -m main.run \
    --model_name=model_name \
    --batch_size=2 \
    --exp_name=exp_name \
    --bug_fix \
    --consistency_num=30 \
    --stage=dev \
    --preds=/path/to/predicted/sql.txt \
    --db_content_index_path=/path/to/db/content/index \
    --annotation=/path/to/dev_annotation.json \
    --output_dir=./output \
    --dev_file=/path/to/dev.json \
    --table_file=/path/to/dev_tables.json \
    --db_dir=/path/to/database

Command Line Arguments

Argument	Description
`--model_name`	Name of the Language Model to use
`--batch_size`	Batch size for processing (default: 2)
`--exp_name`	Name of the experiment
`--bug_fix`	Enable bug fixing functionality
`--bug_only`	Only fix SQL when errors are detected
`--consistency_num`	Number of consistency checks (default: 30)
`--stage`	Processing stage (e.g., 'dev')
`--preds`	Path to predicted SQL statements
`--db_content_index_path`	Path to database content index
`--annotation`	Path to annotation file
`--output_dir`	Directory for output files
`--dev_file`	Path to development set file
`--table_file`	Path to table schema file
`--db_dir`	Path to database directory

Note: Ensure all required files are in place and paths are correctly configured before running the commands.

httdty/REDSQL_VLDB