Recent work on Natural Language Interfaces to Databases (NLIDB) has attracted considerable attention. NLIDB allow users to search databases using natural language instead of SQL-like query languages. While saving the users from having to learn query languages, multi-turn interaction with NLIDB usually involves multiple queries where contextual information is vital to understand the users' query intents. In this paper, we address a typical contextual understanding problem, termed as follow-up query analysis. Our work summarizes typical follow-up query scenarios and provides the new FollowUp
dataset with 1000 query triples on 120 tables.
If you use FollowUp
in your research work, please consider citing our work:
Qian Liu, Bei Chen, Jian-Guang Lou, Ge Jin and Dongmei Zhang. 2019. FANDA: A Novel Approach to Perform Follow-up Query Analysis. In AAAI.
@inproceedings{liu2019fanda,
title={\textsc{FAnDa}: A Novel Approach to Perform Follow-up Query Analysis},
author={Liu, Qian and Chen, Bei and Lou, Jian-Guang and Jin, Ge and Zhang, Dongmei},
booktitle ={AAAI},
year={2019}
}
You could easily evalute your model output on FollowUp dataset following our data/eval.py
script. Put your model prediction (as format of string) case by case under the file data/predict.example
, then run the data/eval.py
as following:
python eval.py
You will get the evalution result test set of FollowUp
. For example, the example prediction result will get the result of:
================================================================================
FollowUp Dataset Evaluation Result
================================================================================
BLEU Score: 100.00 (%)
Symbol Acc: 100.00 (%)
To alleviate the burden of preprocessing data, we provide our processed datasets in the folder data_processed
, and the script will be released soon. The original dataset is placed under data
folder.
tables.jsonl: store the table information, and every line(table) is a json format object. header
means the column names, types
means the types inherited from WikiSQL, id
indicates the table ids origination in WikiSQL, rows
are the values of whole table. A line looks like the following:
{
"header": [
"Date",
"Opponent",
"Venue",
"Result",
"Attendance",
"Competition"
],
"page_title": "2007–08 Guildford Flames season",
"types": [
"real",
"text",
"text",
"text",
"real",
"text"
],
"page_id": 15213262,
"id": [
"2-15213262-12",
"2-15213262-7"
],
"section_title": "March",
"rows": [
[
"6",
"Milton Keynes Lightning",
"Away",
"Lost 3-5 (Lightning win 11-6 on aggregate)",
"537",
"Knockout Cup Semi-Final 2nd Leg"
],
[
"8",
"Romford Raiders",
"Home",
"Won 7-3",
"1,769",
"League"
],
...
[
"28",
"Chelmsford Chieftains",
"Away",
"Won 3-2",
"474",
"Premier Cup"
]
],
"caption": "March"
}
train.tsv and test.tsv: train/test split of FollowUp Dataset. Every line is a tuple of format (Precendent Query, Follow-up Query, Fused Query, Table ID), where the Table ID is line index starting from 1 in tables.jsonl
. Split symbol is TAB(\t). A line looks like the following:
how many champions were there, according to this table? show these champions for different all-star game. show champions for different all-star game. 74
If you have any question or have difficulity in applying your model on the FollowUp dataset, please feel free to concat me: qian.liu AT buaa dot edu dot cn. Sure, you could also create a new issue and I will tackle them as soon as possible.