EvolvingLMMs-Lab/LongVA

Questions about CLIP-ViT-L/14@336px

Opened this issue · 1 comments

Wgkai commented

Thank you for your amazing work.I searched online and found that the CLIP-ViT-L/14@336px model divides an image into 14*14=196 patches, and the embedding dimension is 768. In your work the shape of features after CLIP visual encoder is (576,1024). How does it come?

It is (336/14) ** 2 = 576 patches.
The number 14 refers to the patch size, not the number of patches for each dimension.