bronyayang/Law_of_Vision_Representation_in_MLLMs

Question about A score

waltonfuture opened this issue · 1 comments

In the paper, it’s written that we calculate the maximum cosine similarity between vector pairs from the CLIP embedding and target vision representation embedding. How can this indicate cross-modal alignment since we do not use any textual information from LLM. Look forward to further explanation. Thanks

Hi,

I think this issue can answer your question. Thanks!