CSpider: A Large-Scale Chinese Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task
CSpider is a large Chinese dataset for complex and cross-domain semantic parsing and text-to-SQL task (natural language interfaces for relational databases). It is released with our EMNLP 2019 paper: A Pilot Study for Chinese SQL Semantic Parsing. This repo contains all code for evaluation, preprocessing, and all baselines used in our paper. Please refer to the task site for more general introduction and the leaderboard.
10/2019
We start a Chinese text-to-SQL task with the full dataset translated from Spider. The submission tutorial and our dataset can be found at our task site. Please follow it to get your results on the unreleased test data. Thank Tao Yu for sharing the test set with us.9/2019
The dataset used in our EMNLP 2019 paper is redivided based on the training and deveploment sets from Spider. The dataset can be downloaded from here. This dataset is just released to reproduce the results in our paper. To join the CSpider leaderboard and better compare with the original English results, please refer to our task site for full dataset.
When you use the CSpider dataset, we would appreciate it if you cite the following:
@inproceedings{min2019pilot,
title={A Pilot Study for Chinese SQL Semantic Parsing},
author={Min, Qingkai and Shi, Yuefeng and Zhang, Yue},
booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
pages={3643--3649},
year={2019}
}
Our dataset is based on Spider, please cite it too.
- The code uses Python 2.7 and Pytorch 0.2.0 GPU, and will update python and Pytorch soon.
- Install Pytorch via conda:
conda install pytorch=0.2.0 -c pytorch
- Install Python dependency:
pip install -r requirements.txt
- Download the data, embedding and database:
- To use the full dataset(recommended), download train/dev data from Google Drive or BaiduNetDisk(code: cgh1) and evaluate on the unreleased test data based on the submission tutorial on our task site. Specifically,
- Put the downloaded
train.json
anddev.json
underchisp/data/char/
directory. To use word-based methods, please do the word segmentation first and put the json files underchisp/data/word/
directory. - Put the downloaded
char_emb.txt
underchisp/embedding/
directory. This is generated from the Tencent multilingual embeddings for the cross-lingual word embeddings schema. To use monolingual embedding schema, step 2 is necessary. - Put the downloaded
database
directory underchisp/
directory. - Put the downloaded
train_gold.sql
anddev_glod.sql
underchisp/data/
directory.
- Put the downloaded
- To use the dataset redivided based on the original train and dev data in our paper, download the train/dev/test data from here. This dataset is released just to reproduce the results in our paper and results based on this dataset cannot join the leaderboard. Specifically,
- Put the downloaded
data
,database
andembedding
directory underchisp/
directory. And you can run all the experiments(step 2 is necessary) shown in our paper. models
directory contains all the pretrained models.
- Put the downloaded
- (optional) Download the pretrained Glove, and put it as
chisp/embedding/glove.%dB.%dd.txt
- Generate training files for each module:
python preprocess_data.py -s char|word
data/
contains:char/
for character-based raw train/dev/test data, corresponding processed dataset and saved models can be found atchar/generated_datasets
.word/
for word-based raw train/dev/test data, corresponding processed dataset and saved models can be found atword/generated_datasets
.
train.py
is the main file for training. Usetrain_all.sh
to train all the modules (see below).test.py
is the main file for testing. It usessupermodel.py
to call the trained modules and generate SQL queries. In practice, usetest_gen.sh
to generate SQL queries.evaluation.py
is for evaluation. It usesprocess_sql.py
. In practice, useevaluation.sh
to evaluate the generated SQL queries.
Run train_all.sh
to train all the modules.
It looks like:
python train.py \
--data_root path/to/char/or/word/based/generated_data \
--save_dir path/to/save/trained/module \
--train_component <module_name> \
--emb_path path/to/embeddings
--col_emb_path path/to/corresponding/embeddings/for/column
Run test_gen.sh
to generate SQL queries.
test_gen.sh
looks like:
python test.py \
--test_data_path path/to/char/or/word/based/raw/dev/or/test/data \
--models path/to/trained/module \
--output_path path/to/print/generated/SQL \
--emb_path path/to/embeddings
--col_emb_path path/to/corresponding/embeddings/for/column
Run evaluation.sh
to evaluate generated SQL queries.
evaluation.sh
looks like:
python evaluation.py \
--gold path/to/gold/dev/or/test/queries \
--pred path/to/predicted/dev/or/test/queries \
--etype evaluation/metric \
--db path/to/database \
--table path/to/tables \
evalution.py
is from the general evaluation process in the Spider github page.
The implementation is based on SyntaxSQLNet. Please cite it too if you use this code.