Duplicates in movies dataset

Question

Duplicates in movies dataset

xiaoyuin opened this issue 6 years ago · 1 comments

Hello!

Firstly thank you very much for your repository and research. This is a very interesting field. I am currently using your monument dataset as the training data in my master thesis.

I notice you uploaded a new dataset called movies_300.zip several days ago. I intended to try it in my experiments as well but I found that it has many duplicate lines in the training file (e.g. "how long is the longest movie" showed 227 times in 'train.en').
Could you explain what is the reason for that? Is it appropriate to use this dataset for training or this dataset is just made for other tasks?

Thank you and best regards
Xiaoyu

Answer 1 · 2018-10-22T07:10:12.000Z

Hi @xiaoyuin and thanks for your interest.

You are not the only one who pointed this out. In fact, that should not have happened. All lines in dev and test appearing also in train should have been removed. After that, you can use the dataset for training. We are going to update it soon.