Mismatch between knowledge source and dataset

Question

Mismatch between knowledge source and dataset

Haelles opened this issue 2 years ago · 4 comments

Dear author,
Thanks for your great work. However, I find there is mismatch between dev set and knowledge base. For example, one of the evidence of hover_dev['faaec546-3cd6-4635-b7c8-dbdc17de410e'](index: 318) is ['Project Timberwind', 4]]. However, in the knowledge base, there are only four sentences in the wiki page 'Project Timberwind'('id': '202424').
Could you please have a look at the dataset?

Answer 1 · 2023-03-18T14:19:08.000Z

I think you’re referring to a known issue. The HoVer sentence splitting differs occasionally from the HotPotQA splitting of the same corpus. We stick to HotPotQA’s original splitting.

Answer 2 · 2023-03-18T15:07:41.000Z

I think you’re referring to a known issue. The HoVer sentence splitting differs occasionally from the HotPotQA splitting of the same corpus. We stick to HotPotQA’s original splitting.

Thanks for your reply! However, I'm still confused. I used the corpus from hotpotqa(https://nlp.stanford.edu/projects/hotpotqa/enwiki-20171001-pages-meta-current-withlinks-abstracts.tar.bz2) and found this issue. What should I do with this claim?

Answer 3 · 2023-03-18T15:20:00.000Z

Yes, we also use the corpus from HotPotQA. The dev set is from HoVer. You can simply ignore the rare indices that correspond to non-existing sentences.

Answer 4 · 2023-03-22T15:28:38.000Z

Thanks for your reply!