/text2sql

Datasets for Natural Language Interface to Databases

GNU General Public License v3.0GPL-3.0

nlidb-datasets

SENLIDB Dataset

This dataset contains 24,890 unique SQL queries, extracted from Stack Exchange Data Explorer site (http://data.stackexchange.com/stackoverflow/queries).

In addition to the information available on the original site, this dataset contains 780 additional manual annotations for 296 queries.

Contents:

  • README.md: this file

  • train.json:

    Contains 24594 entries, with the following structure:

    {
        title       : "(string) Title provided in natural language that explains the meaning of the SQL query"
        description : "(string) A longer description of the query (may be missing)"
        sql_plain   : "(string) The original query, as it appeared on data.stackexchange"
        sql         : "(string) Query with comments removed"
        comments    : "(list of strings) List with the removed comments"
        url         : "(string) URL where the query or its revision can be found"
        id          : "(string) Unique identifier, extracted from the url"
    }
  • test.json:

    Contains 296 entries that follow the structure from the training file, with the addition of the following fields:

    {
        annotations: [
            {
                annotation  : "(string) Natural Language manual annotation"
                annotator_id: "(integer) Number uniquely identifying the annotator"
            }
        ]
    }

Acknowledgements

This work has been funded by the Text2NeuralQL research project (PN-III-P2-2.1-PTE-2016-0109).

License

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Reference

Brad, Florin, Radu Iacob, Ionel Hosu, and Traian Rebedea. "Dataset for a Neural Natural Language Interface for Databases (NNLIDB)." arXiv preprint arXiv:1707.03172 (2017). https://arxiv.org/abs/1707.03172