/FollowUp

public dataset for followup-query analysis, accepted by AAAI2019

Primary LanguagePythonOtherNOASSERTION

FollowUp Dataset

Recent work on Natural Language Interfaces to Databases (NLIDB) has attracted considerable attention. NLIDB allow users to search databases using natural language instead of SQL-like query languages. While saving the users from having to learn query languages, multi-turn interaction with NLIDB usually involves multiple queries where contextual information is vital to understand the users' query intents. In this paper, we address a typical contextual understanding problem, termed as follow-up query analysis. Our work summarizes typical follow-up query scenarios and provides the new FollowUp dataset with 1000 query triples on 120 tables.

Citation

If you use FollowUp in your research work, please consider citing our work:

Qian Liu, Bei Chen, Jian-Guang Lou, Ge Jin and Dongmei Zhang. 2019. FANDA: A Novel Approach to Perform Follow-up Query Analysis. In AAAI.

@inproceedings{liu2019fanda,
  title={\textsc{FAnDa}: A Novel Approach to Perform Follow-up Query Analysis},
  author={Liu, Qian and Chen, Bei and Lou, Jian-Guang and Jin, Ge and Zhang, Dongmei},
  booktitle ={AAAI},
  year={2019}
}

Evaluation

You could easily evalute your model output on FollowUp dataset following our data/eval.py script. Put your model prediction (as format of string) case by case under the file data/predict.example, then run the data/eval.py as following:

python eval.py

You will get the evalution result test set of FollowUp. For example, the example prediction result will get the result of:

================================================================================
                     FollowUp Dataset Evaluation Result
================================================================================
BLEU Score:  100.00 (%)
Symbol Acc:  100.00 (%)

Processed Data

To alleviate the burden of preprocessing data, we provide our processed datasets in the folder data_processed, and the script will be released soon. The original dataset is placed under data folder.

Tables

tables.jsonl: store the table information, and every line(table) is a json format object. header means the column names, types means the types inherited from WikiSQL, id indicates the table ids origination in WikiSQL, rows are the values of whole table. A line looks like the following:

{
	"header": [
		"Date",
		"Opponent",
		"Venue",
		"Result",
		"Attendance",
		"Competition"
	],
	"page_title": "2007–08 Guildford Flames season",
	"types": [
		"real",
		"text",
		"text",
		"text",
		"real",
		"text"
	],
	"page_id": 15213262,
	"id": [
		"2-15213262-12",
		"2-15213262-7"
	],
	"section_title": "March",
	"rows": [
		[
			"6",
			"Milton Keynes Lightning",
			"Away",
			"Lost 3-5 (Lightning win 11-6 on aggregate)",
			"537",
			"Knockout Cup Semi-Final 2nd Leg"
		],
		[
			"8",
			"Romford Raiders",
			"Home",
			"Won 7-3",
			"1,769",
			"League"
		],
		...
		[
			"28",
			"Chelmsford Chieftains",
			"Away",
			"Won 3-2",
			"474",
			"Premier Cup"
		]
	],
	"caption": "March"
}

Content

train.tsv and test.tsv: train/test split of FollowUp Dataset. Every line is a tuple of format (Precendent Query, Follow-up Query, Fused Query, Table ID), where the Table ID is line index starting from 1 in tables.jsonl. Split symbol is TAB(\t). A line looks like the following:

how many champions were there, according to this table?	show these champions for different all-star game.	show champions for different all-star game.	74

Concat

If you have any question or have difficulity in applying your model on the FollowUp dataset, please feel free to concat me: qian.liu AT buaa dot edu dot cn. Sure, you could also create a new issue and I will tackle them as soon as possible.