showlab/EgoVLP

Pre-trained TimesFormer or backbone ViT model

kurshakuz opened this issue · 1 comments

Hi!

I would like to know if you have released a separate video TimesFormer encoder or backbone ViT model that can be directly used for further fine-tuning on other egocentric tasks. Is there a way to extract pre-trained features for loading into existing ViT models?

Additionally, can you please clarify which Timesformer architecture are you using? Is it vanilla TimeSformer, TimeSformer-HR, or TimeSformer-L?

Thanks!

Hi, kurshakuz, the video encoder weight should be able to load from the released checkpoints (which contains video + text encoders), I do not released a separate video TimesFormer weight.

For the second question, I use Timeformer-base. If you want to try other architectures, please refer Lavila, which provides good ablation studies regarding model scale. hope it help!