Where is the LC-QuAD dataset?
DonnieZhang586 opened this issue · 3 comments
The LC-QuAD data set has 5000 pairs, but I generated it through the lc-quad cvs file in the data path, and the result exceeded hundreds of thousands of LC-QuAD sentence pairs.Please can you help me generate accurate LC-QuAD data set
Hi @DonnieZhang586 ,
Thank you for raising the issue.
I need more information to understand the issue: shed more light on what but I generated it through the lc-quad cvs file in the data path
means, what is data path
here and what is lc-quad cvs
.
A script to recreate the issue faced by you will be very helpful in this regard.
As far as Where is the LC-QuAD dataset? is concerned , you may find relevant information here: https://github.com/AskNowQA/LC-QuAD.
Cheers,
Anand Panchbhai
Sorry, I did n't describe the problem clearly, I want to know LC-QUAD
How are the 5000 en-sparql statement pairs of the data set generated? At present, I only see json files. I tried to extract the en- sparql statement pairs by my own method, but reproduced the en-sparql through machine translation technology. The result of the conversion differs by 30 bleu values, so I want to know how you get 5000 en-sparql sentence pairs? Can you describe your generation process in
detail?
best wish
Donnie Zhang
Hello Donnie,
I guess there are some misunderstandings here.
The LC-QUAD is a benchmark dataset to QA, as well as QALD.
We have done a work creating a large dataset to support Neural Question Answering over DBpedia called DBNQA which can be found here https://github.com/AKSW/DBNQA.
This dataset contains QA templates extracted from both QALD and LCQUAD.
You can read more about it here: https://www.researchgate.net/publication/324482598_Generating_a_Large_Dataset_for_Neural_Question_Answering_over_the_DBpedia_Knowledge_Base.
DBNQA is well known to achieve better F-measure than LCQUAD alone, in fact, according to Yin et al (https://arxiv.org/abs/1906.09302) it can deliver up to 50% better F-measure.