Question about the dataset

Question

heendung opened this issue 6 years ago · 0 comments

Hi,
I have a few questions about the dataset you constructed in the paper(EMNLP 2017).

In the paper, you mentioned the complete English Wikipedia corpus was used to construct the dataset. How many articles does this corpus have? does this corpus contain all articles from English Wikipedia dump, enwiki-latest-pages-articles.xml.bz2, released at that time?
Based on my understanding, you mapped the link annotations to Wikidata entities by running SPARQL queries. How many annotations did you use for mapping? Also, how long did it take to map all link annotations to Wikidata entities?
I am trying to map the link annotations to Wikidata entities by submitting the queries to Wikidata server with GET and POST request(i.e. using Wikidata Query Service, https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual). However, the problem is that with Wikidata Query Service, it may take very long time to map all the link annotations to the Wikidata entities. May I know how you queried the data?

Thanks for your help in advance.