OpenGVLab/unmasked_teacher

Retrieval dataset process approach

Seolen opened this issue · 1 comments

Seolen commented

Thanks for your impressive work, I have a question to evaluate video-text retrieval: In datasets such as MSVD and MSRVTT, each video is attached with multiple captions. How do you process this problem for retrieval?

Yes, in training data, there are multiple corresponding captions for videos. When training, we do not process the problem and just fine-tune the models with VTC (video-text contrastive) and VTM (video-text matching) loss.

In testing data, there is only one caption for a video.