showlab/EgoVLP

number of frames per clip

fmu2 opened this issue · 2 comments

fmu2 commented

Thanks for the great work!

I am confused by how "num_frames" is set in video_params in the config files. If I understand correctly, the pre-trained Frozen model has num_frames=16 whereas only four frames are given as input to the model at training and inference time. In Table 4 of the paper, there are two entries for Frozen+EgoNCE with #frames equal to 4 and 16, respectively. I am wondering what is the difference here, and which corresponds to the pre-trained model weights (EgoVLP_PT_BEST) available in the repository? May I still provide 16 frames instead of four to the provided model for feature extraction? Thank you!

Hi @fmu2 , thanks for your interest.

In the pretraining phase, Frozen model with num_frames=16 can support a maximum of 16 frames input but we only use num_frames=4 for input due to computation cost.

In the downstream tasks of Tab. 4, we based on the same pretrained weights EgoVLP_PT_BEST pretrained with 4 frames, and try two variants. One is to use num_frames=4 for fine-tuning (same as pretraining), and we also try num_frames=16 for fine-tuning (though 12 frames temporal position not learned). The latter get better results.

EgoVLP_PT_BEST is corresponding to the Frozen + EgoNCE, with 4 frames pretraining.

For off-line feature extraction e.g., NLQ, I do not recommend 16 frames since only 4 frames are learned during pretraining.
But if you want to fine-tune in downstream e.g., Charades-STA, EPIC-Kitchens, 16 frames is a better choice.

Please reach out if you have other issues.

Kevin

fmu2 commented

Thanks for your reply!