The number of preprocessed dataset
DSKSD opened this issue · 1 comments
DSKSD commented
Hi, thank you for sharing great work!
I explore the Topiocqa dataset to train my own retriever.
And I found that the number of preprocessed dataset data.retriever.all_history
is different from the paper(raw dataset).
name | # |
---|---|
topiocqa_train.json | 45450(same as the paper) |
topiocqa_dev.json | 2514(same as the paper) |
data.retriever.all_history/train.json | 45650 |
data.retriever.all_history/dev.json | 2525 |
I wonder what is the difference between them.
What data should I use to train my model to compare your baseline (DPR)?
Thanks!
Best,
vaibhavad commented
Hi @DSKSD,
Thank you for bringing this to our notice. In the previous release, we accidentally included some questions which were the last questions of the conversations that did not have any answer (the answerer annotator probably got disconnected). In the current release, we have removed such cases, so the numbers should match now.