We release, and maintain a gold standard KBQA (Question Answering over Knowledge Base) dataset containing 5000 Question and SPARQL queries. LC-QuAD uses DBpedia v04.16 as the target KB.
License: You can download the dataset (released with a GPL 3.0 License), or read below to know more.
Versioning: We use DBpedia version 04-2016 as our target KB. The public DBpedia endpoint (http://dbpedia.org/sparql) no longer uses this version, which might cause many SPARQL queries to not retrieve any answer. We strongly recommend hosting this version locally. To do so, see this guide
Splits: We release the dataset split into training, and test in a 80:20 fashion.
Format: The dataset is released in JSON dumps, where the key
corrected_question
contains the question, and query
contains the corresponding SPARQL query.
The dataset generated has the following JSON structure, kept intact for .
{
'_id': 'Unique ID of this datapoint',
'corrected_question': 'Corrected, Final Question',
'id': 'Template ID',
'query': 'SPARQL Query',
'template': 'Template used to create SPARQL Query',
'intermediary_question': 'Automatically generated, grammatically incorrect question'
}
@inproceedings{trivedi2017lc,
title={Lc-quad: A corpus for complex question answering over knowledge graphs},
author={Trivedi, Priyansh and Maheshwari, Gaurav and Dubey, Mohnish and Lehmann, Jens},
booktitle={International Semantic Web Conference},
pages={210--218},
year={2017},
organization={Springer}
}
We're in the process of automating the benchmarking process (and updating results on our webpage). In the meantime, please get in touch with us at priyansh.trivedi@uni-bonn.de, and we'll do it manually. Apologies for this inconvinience.
Overview
- Automatically create SPARQL queries.
- Convert SPARQL queries to intermediary NLQs.
- Manually correct intermediary NLQs to create Questions
We start with a set of Seed Entities, and Predicate Whitelist. Using the whitelist, we generate 2-hop subgraphs around seed entities. With a seed entity as supposed answer, we juxtapose SPARQL Templates onto the subgraph, and generate SPARQL queries.
Corresponding to SPARQL template, and based on certain conditions, we assign hand-made NL question templates to the SPARQLs. Refer to this diagram to understand the nomenclature used in templates.
Finally, we follow a two-step (Correct, Review) system to generate a grammatically correct question for every template-generated one.
- Published train-test splits
- Website Updated
- Updated public website
- Dataset now available in QALD format
- Leaderboard underway
- Fixed a bug with rdf:type filter in SPARQL
- data_set.json updated
- updated templates.py
- First version released
- lc-quad.sda.tech published