mrqa/MRQA-Shared-Task-2019

Duplicate Samples in Dev Set

luomancs opened this issue · 2 comments

Hi,
I found that there were duplicated samples in the development set of Natural Question: the context and the question are exactly the same in two examples. For example, the question 6357c3655b524feb8d0e398ff61dfabf and the question 44e059927ac841d489d580a29222683b are the same! if remove duplicated questions, the NQ development set reduce from 12836 to 5529 examples. Could you please check if my finding is true or I miss something? Thank you.

Hi, I found the same problem in dev. But instead of 5529.

I found unique queries, contexts, answers: (4177, 5332, 6230)

Does anyone has any insights about that? Many thanks.

Hi, thanks for bringing this up! Due to a preprocessing error, if an example has multiple annotations, it will split into multiple instances. So, there are some duplicated questions, but with different ground truth answers. We'll release an updated version shortly.