stanford-futuredata/Baleen

Mismatch between knowledge source and dataset

Haelles opened this issue · 4 comments

Dear author,
  Thanks for your great work. However, I find there is mismatch between dev set and knowledge base. For example, one of the evidence of hover_dev['faaec546-3cd6-4635-b7c8-dbdc17de410e'](index: 318) is ['Project Timberwind', 4]]. However, in the knowledge base, there are only four sentences in the wiki page 'Project Timberwind'('id': '202424').
  Could you please have a look at the dataset?

okhat commented

I think you’re referring to a known issue. The HoVer sentence splitting differs occasionally from the HotPotQA splitting of the same corpus. We stick to HotPotQA’s original splitting.

I think you’re referring to a known issue. The HoVer sentence splitting differs occasionally from the HotPotQA splitting of the same corpus. We stick to HotPotQA’s original splitting.

Thanks for your reply! However, I'm still confused. I used the corpus from hotpotqa(https://nlp.stanford.edu/projects/hotpotqa/enwiki-20171001-pages-meta-current-withlinks-abstracts.tar.bz2) and found this issue. What should I do with this claim?

okhat commented

Yes, we also use the corpus from HotPotQA. The dev set is from HoVer. You can simply ignore the rare indices that correspond to non-existing sentences.

Thanks for your reply!