muzairkhattak/multimodal-prompt-learning

Tensor shape of textual and visual prompt

Closed this issue · 7 comments

Hi, thank you for your excellent work.

During the process of running your project, I have a question.
I noticed that the dimensions of the text features are [num_classes, transformer.width], while the dimensions of the image features are [batch_size, transformer.width].
Is this correct? I thought that the dimensions of both features should be [num_classes, transformer.width]. Could you explain briefly how you designed the visual prompt?

Thank you.

Hi @xiapeng1110,

Thank you for showing interest in MaPLe!

Yes you are correct, we get text features of size [num_classes, transformer.width] from the text encoder. Here num_classes act as the batch size for the text encoder. Similarly, the visual features size is [batch_size, transformer.width]. We get final predictions per image sample by performing matrix multiplication between both features and the final output size is then [batch_size, num_classes]. It can be interpreted as having a set of prediction scores for each image.

Regarding the visual prompt creation, we pass the textual prompt vectors as an input to the coupling function (implemented as a linear layer) to obtain the visual prompts as outputs. This process is shown in this part of the code.

These output visual prompts are then concatenated with the input image tokens during both training and inference.

I hope that is clear now. Please feel free to ask incase there is any further query.

Thank you very much!

Hi, thanks for your reply.

So, the visual prompt is simply the vectors obtained from the textual prompt through the coupling function, with dimensions [batch_size, transformer.width], and then these vectors are concatenated with the tokens from the input image (could you please let me know where the specific implementation code for this part is located?). In this case, how should we understand the role of the visual prompt?

Thank you very much.

Hi @xiapeng1110,

Yes the visual prompts are concatenated with the tokens from the input image. Please refer to these lines and these lines to see the concatenation process in our code.

The role of the visual prompts is to steer the image side representations which is indirectly similar to fine-tuning the image encoder side. Regarding its conceptual explanation, you can think of these prompts as the image enhancing context vectors. So when the are appended with the normal image tokens, they attempt to enhance each image patch token which eventually helps in the final classification.

Thank you!

Thanks a lot.

I understand your explanation of the code implementation. However, in this section of your code, you have defined design_details["vision_depth"] = 0, which means VPT_shallow = False. Does this imply that Maple does not utilize the visual prompt?

No problem.

Regarding your query, the variables which are set to 0 in these lines refer to the design choices of another baseline method called Independent Vision Language Prompting (IVLP), so this will not effect the design of MaPLe.. So MaPLe uses visual prompts.

Got it!

It helps me a lot. Thanks very much!

No problem!
I'm glad we could help you out.

Thank you!