Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness

Description

This repo contains a diagnostic evaluation benchmark toward the robustness of text-to-SQL models, which contains 17 perturbation test sets to measure the robustness of models from different angles. It is released along with our ICLR 2023 paper: Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness. The details can be found in our paper.

The dataset is created using the dev set in the Spider dataset and our changes to the Spider dataset are to supplement the work done by Spider.

Preprocessing

First, unzip the data using the following command.

mkdir data
tar -xvf data.tar.gz -C data

Run data_preprocess.py to copy pre-perturbed databases and tables from the original spider devlopment set.

python data_preprocess.py

To Use

Each folder contains a perturbation test set. There are 3 DB perturbation test sets (starting with DB_), 9 NLQ perturbation test sets (starting with NLQ_), and 5 SQL perturbation test sets (starting with SQL_). Each test contains parallel pre-perturbation and post-perturbation test data.

DB_*: data with DB perturbation, which contain two database folders, two table files, and two question files, corresponding to pre-perturbation and post-perturbation data.
NLQ_*: data with NLQ perturbation, which contain a single database folder, table file, and two question files (one for pre-perturbation and the other for post-perturbation).
SQL_*: data with SQL perturbation, which contain a single database folder, table file, and two question files (one for pre-perturbation and the other for post-perturbation).

First, run the model on Spider-dev set to get the predicted SQL queries and put it in predictions/Spider-dev/[model_name]/pred.sql. Then, run the model on each post-perturbation set to get the predicted SQL queries in predictions/[perturbation_namq]/[model_name]/pred.sql.

To Evaluate a Model

Run copy_pre_perturbation_predictions.py to generate copy the SQL prediction in Spider-dev to all pre-perturbation sets. Evalaute the model on each pre-perturbation and post-perturbation set using the test-suite evaluation.

python copy_pre_perturbation_predictions.py --model [model_name]

Leaderboard

Pre-perturbation and post-perturbation accuracy in terms of execution (EX)

The EX accuracy of models on pre-perturbation and post-perturbation data. We report the marco average results of the perturbation test sets in DB, NLQ, SQL sets. x-y represents the accuracy on pre-perturbation data and post-perturbation data.

Evaluation of Finetuned Models

Model	Average of DB perturbation test sets	Average of NLQ perturbation test sets	Average of SQL perturbation test sets	Average of all test sets
Picard	78.9-55.0	76.0-65.0	76.3-74.0	76.6-65.9
SmBoP	74.7-50.0	76.6-58.1	74.7-72.2	75.7-60.8
T5-3B LK	73.5-47.0	70.4-58.9	71.7-69.6	71.3-59.9
T5-3B	69.5-42.9	68.2-54.9	70.9-69.5	69.2-57,1
T5-large	64.0-36.7	63.6-50.9	65.6-64.7	64.2-54.2
RatSQL	70.8-33.9	70.2-50.7	68.8-62.4	69.9-51.5
T5-base	51.1-22.8	50.0-32.6	56.9-51.8	54.3-40.6

Evaluation of In-context Learning Methods

Model	Average of DB perturbation test sets	Average of NLQ perturbation test sets	Average of SQL perturbation test sets	Average of all test sets
Codex	72.6-60.7	75.3-60.8	74.6-73.1	74.6-64.4

Citation and Contact

If you use the dataset in your work, please cite our paper and the Spider paper.

@article{chang2023dr,
  title={Dr. Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness},
  author={Chang, Shuaichen and Wang, Jun and Dong, Mingwen and Pan, Lin and Zhu, Henghui and Li, Alexander Hanbo and Lan, Wuwei and Zhang, Sheng and Jiang, Jiarong and Lilien, Joseph and others},
  journal={arXiv preprint arXiv:2301.08881},
  year={2023}
}

@inproceedings{yu2018spider,
  title={Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
  author={Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and others},
  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
  pages={3911--3921},
  year={2018}
}

Please contact Shuaichen Chang (chang.1692[at]osu.edu) for questions and suggestions.

Acknowledgement

We thank the authors of Spider for allowing us to redistribute the data in Spider development set.

awslabs/diagnostic-robustness-text-to-sql