Retrieval dataset process approach
Seolen opened this issue · 1 comments
Seolen commented
Thanks for your impressive work, I have a question to evaluate video-text retrieval: In datasets such as MSVD and MSRVTT, each video is attached with multiple captions. How do you process this problem for retrieval?
Andy1621 commented
Yes, in training data, there are multiple corresponding captions for videos. When training, we do not process the problem and just fine-tune the models with VTC (video-text contrastive) and VTM (video-text matching) loss.
In testing data, there is only one caption for a video.