number of frames per clip
fmu2 opened this issue · 2 comments
Thanks for the great work!
I am confused by how "num_frames" is set in video_params in the config files. If I understand correctly, the pre-trained Frozen model has num_frames=16 whereas only four frames are given as input to the model at training and inference time. In Table 4 of the paper, there are two entries for Frozen+EgoNCE with #frames equal to 4 and 16, respectively. I am wondering what is the difference here, and which corresponds to the pre-trained model weights (EgoVLP_PT_BEST) available in the repository? May I still provide 16 frames instead of four to the provided model for feature extraction? Thank you!
Hi @fmu2 , thanks for your interest.
In the pretraining phase, Frozen model with num_frames=16
can support a maximum of 16 frames input but we only use num_frames=4
for input due to computation cost.
In the downstream tasks of Tab. 4, we based on the same pretrained weights EgoVLP_PT_BEST pretrained with 4 frames, and try two variants. One is to use num_frames=4
for fine-tuning (same as pretraining), and we also try num_frames=16
for fine-tuning (though 12 frames temporal position not learned). The latter get better results.
EgoVLP_PT_BEST is corresponding to the Frozen + EgoNCE, with 4 frames pretraining.
For off-line feature extraction e.g., NLQ, I do not recommend 16 frames since only 4 frames are learned during pretraining.
But if you want to fine-tune in downstream e.g., Charades-STA, EPIC-Kitchens, 16 frames is a better choice.
Please reach out if you have other issues.
Kevin
Thanks for your reply!