This repository contains code for the NAACL 2022 findings paper "Measuring and Improving Compositional Generalization in Text-to-SQL via Component Alignment".
If you use Spider-SS or Spider-CG in your work, please cite it as follows:
@misc{gan-etal-2022-measuring-and,
doi = {10.48550/ARXIV.2205.02054},
url = {https://arxiv.org/abs/2205.02054},
author = {Gan, Yujian and Chen, Xinyun and Huang, Qiuping and Purver, Matthew},
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Measuring and Improving Compositional Generalization in Text-to-SQL via Component Alignment},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
This repository is built upon the NatSQL. Some algorithms mentioned in the SpiderSS paper are stored in the NatSQL repository, such as the sentence split. You should download the NatSQL and this repository, then combine these two repositories by copying the files in this repository into the root path of the NatSQL.
After combination, install Python dependency via pip install -r requirements.txt
.
Download the datasets: Spider. Make sure to download the 06/07/2020
version or newer.
Unpack the datasets somewhere outside this project and put train_spider.json
, dev.json
, tables.json
and database
folder under ./data/
directory.
Run check_and_preprocess.sh
to check and preprocess the dataset. It will generate (1) the train_spider.json
and dev.json
with NatSQLG ; (2) preprocessed tables.json
and tables_for_natsql.json
; under ./NatSQLv1_6/
directory.
Run sh preprocess_spider.sh
to preprocess the Spider dataset.
You should get preprocess files train_spider-preprocessed.json
and dev-preprocessed.json
. Alternatively, You can download our preprocessed Spider dataset here.
Run sh generate_spiderSS.sh
to generate the SpiderSS dataset.
You should get spiderSS files train_spider-SS-preprocessed.json
, train_spider-SS-for-training.json
, dev-SS-preprocessed.json
, and dev-SS-for-training.json
. The two *-for-training
files can be used by modified models. Alternatively, You can download our generated Spider-SS dataset here.
Run sh generate_spiderCG.sh
to generate the SpiderCG dataset.
You should get spiderCG files train_spider-CG_SUB.json
, train_spider-CG_APP.json
, dev-CG_SUB.json
, and dev-CG_APP.json
. Alternatively, You can download our generated Spider-CG dataset here.
The code and data are under the CC BY-SA 4.0 license.