Issues about Freezing some additional layers instead of meanP in CLIP4Clip
celestialxevermore opened this issue · 2 comments
Dear Author, I really deeply fascinated to your multimodal studies in these days.
To extended questions about ArrowLuo/CLIP4Clip#42, I have questions in my own.
So I really appreciate in advance for your kind teaching and advice.
Though I want to apply the cross module for fine-grained, cross representations, like you did in UniVL,
I suddenly come up with that before questionaires in my mind.
as you mentioned before in upper link,
transformer in the cross module gains randomly initialized weights, so it cannot outperform than when I just set the similarity policy in meanP.
Here is my question :
- Even though the cross module has the limits than meanP as you mentioned before,
is there any special reason you selected cross module, not meanP in UniVL? - Is it possible in CLIP4Clip that Masking modelling as you did in UniVL?
- Unlike CLIP4Clip,
Line 224 in 0a7c07f
Can you explain the exact meaning of 'cross_output', 'concat_mask', and following objects : sequence_cross_output, and visual_cross_output?
I guest that sequence_cross_output and visual_cross_output have more multi-modally engaged than offline representations - sequence_output , visual_output - though I want to know that.
I really feel enthusiastic in your studies, and thanks for your contribution in multimodal fields.
Sincerely,
Hi @celestialxevermore, sorry for my delayed reply, and thanks for your interest. I had some personal manners before I see your problem. I will try my best to answer the questions.
- The UniVL is not designed only for the retrieval task. Moreover, the first stage pertaining is also the same as the meanP in CLIP4Clip.
- Yes, but will cost more GPUs
- Yes, the sequence_cross_output and visual_cross_output have more multi-modally engaged, as said.
Best~
Haha, not at all. I always feel thankful of your kindness, not thinking about the delay. Thank you for your help.