This README provides comprehensive instructions for setting up the environment, downloading the dataset, and running REDSQL.
The dataset is organized into the following directory structure:
.
├── bird
│ ├── database # Database directory
│ ├── dev_annotation.json # Generated annotations
│ ├── dev.json # Development set input
│ ├── dev.sql # Ground truth SQL queries
│ └── dev_tables.json # Database schema
├── spider
│ ├── database
│ ├── dev_annotation.json
│ ├── dev_gold.sql
│ ├── dev.json
│ └── dev_tables.json
├── preds
│ └── Predicted_SQLs # SQL predictions from baseline methods (e.g., PURPLE, Codes)
...
- bird: Contains the BIRD dataset files including database, annotations, development set, ground truth SQL queries, and schema information.
- spider: Contains the Spider dataset files with similar structure to BIRD.
- preds: Contains SQL predictions from various baseline methods.
sudo apt-get update
sudo apt-get install -y openjdk-11-jdk
conda create -n red python=3.9
conda activate red
# Install PyTorch
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch
# Install NMSLib
conda install -c conda-forge nmslib
# Install remaining requirements
pip install -r requirements.txt
mkdir output logs
python -m pre_processing.build_contents_index \
--output_dir=./index/bird/db_contents_index/ \
--db_dir=./datasets/bird/dev_database/
Note: This step can be skipped as we provide pre-generated annotations for the following datasets:
- BIRD
- Science
- Spider
These annotations are available in our open-source repository.
If you need to generate annotations for a custom dataset, use the following command:
python -m pre_processing.doc \
--model_name=gpt-4o-2024-08-06 \
--output_file=./annotation.json \
--table_file=./datasets/bird/dev_tables.json \
--db_dir=./datasets/bird/database/
python -m main.run \
--model_name=model_name \
--batch_size=2 \
--exp_name=exp_name \
--bug_fix \
--consistency_num=30 \
--stage=dev \
--preds=/path/to/predicted/sql.txt \
--db_content_index_path=/path/to/db/content/index \
--annotation=/path/to/dev_annotation.json \
--output_dir=./output \
--dev_file=/path/to/dev.json \
--table_file=/path/to/dev_tables.json \
--db_dir=/path/to/database
Argument | Description |
---|---|
--model_name |
Name of the Language Model to use |
--batch_size |
Batch size for processing (default: 2) |
--exp_name |
Name of the experiment |
--bug_fix |
Enable bug fixing functionality |
--bug_only |
Only fix SQL when errors are detected |
--consistency_num |
Number of consistency checks (default: 30) |
--stage |
Processing stage (e.g., 'dev') |
--preds |
Path to predicted SQL statements |
--db_content_index_path |
Path to database content index |
--annotation |
Path to annotation file |
--output_dir |
Directory for output files |
--dev_file |
Path to development set file |
--table_file |
Path to table schema file |
--db_dir |
Path to database directory |
Note: Ensure all required files are in place and paths are correctly configured before running the commands.