Question about A score

Question

Question about A score

waltonfuture opened this issue 3 months ago · 1 comments

In the paper, it’s written that we calculate the maximum cosine similarity between vector pairs from the CLIP embedding and target vision representation embedding. How can this indicate cross-modal alignment since we do not use any textual information from LLM. Look forward to further explanation. Thanks

Answer 1 · 2024-10-24T21:42:53.000Z

Hi,

I think this issue can answer your question. Thanks!