McGill-NLP/topiocqa

The number of preprocessed dataset

DSKSD opened this issue · 1 comments

DSKSD commented

Hi, thank you for sharing great work!

I explore the Topiocqa dataset to train my own retriever.
And I found that the number of preprocessed dataset data.retriever.all_history is different from the paper(raw dataset).

name #
topiocqa_train.json 45450(same as the paper)
topiocqa_dev.json 2514(same as the paper)
data.retriever.all_history/train.json 45650
data.retriever.all_history/dev.json 2525

I wonder what is the difference between them.
What data should I use to train my model to compare your baseline (DPR)?

Thanks!
Best,

Hi @DSKSD,

Thank you for bringing this to our notice. In the previous release, we accidentally included some questions which were the last questions of the conversations that did not have any answer (the answerer annotator probably got disconnected). In the current release, we have removed such cases, so the numbers should match now.