Question about A score
waltonfuture opened this issue · 1 comments
waltonfuture commented
In the paper, it’s written that we calculate the maximum cosine similarity between vector pairs from the CLIP embedding and target vision representation embedding. How can this indicate cross-modal alignment since we do not use any textual information from LLM. Look forward to further explanation. Thanks
bronyayang commented
Hi,
I think this issue can answer your question. Thanks!