microsoft/VideoX

Questions on CLIP finetuning

hanoonaR opened this issue · 4 comments

Hi Authors,

Thank you for sharing your great work.

I have a few questions on the choice of CLIP fine-tuning used in the 1) few-shot experiments(Table-5) and 2) Zero-shot experiments (Table- 3 and 4). In your ablation on the choice of CLIP fine-tuning in Table 7, you have used different settings (frames 8 vs 32 or 16) from the original tables. So, it is hard to conclude whether both CLIP image and CLIP text are fine-tuned in the reported numbers in your few-shot and zero-shot setting in the main table.

In your code-base, the CLIP text encoder is frozen by default, so could you please provide some clarity?

Thank you.

nbl97 commented

Thanks for your interest, and sorry for the confusion.
We fix the text encoder and finetune the visual encoder in all settings, including few-shot, zero-shot and many-shot.
Hope it can help you~

Hi,

Just to verify, does it mean that the reported numbers for the fully supervised settings are also obtained with frozen text encoder?

Thanks!

nbl97 commented

Hi,

Just to verify, does it mean that the reported numbers for the fully supervised settings are also obtained with frozen text encoder?

Thanks!

Yes, the text encoder is always frozen. Sorry for the confusion again.

Hi Bolin,
thank you for sharing the nice work and implementations!
You showed in Table 7 that finetuning both the visual and text encoders leads to the best performance, at a cost of higher CUDA memory consumption.
Do you also have the results for all settings (fully-supervised closed-set, zero-shot, few-shot) when finetuning both encoders? I expect them to be higher than the numbers reported in Table 1-5.
Looking forward to your reply and thank you in advance!