Questions about CLIP-ViT-L/14@336px

Question

Questions about CLIP-ViT-L/14@336px

Opened this issue a month ago · 1 comments

Thank you for your amazing work.I searched online and found that the CLIP-ViT-L/14@336px model divides an image into 14*14=196 patches, and the embedding dimension is 768. In your work the shape of features after CLIP visual encoder is (576,1024). How does it come?

Answer 1 · 2024-11-20T20:25:35.000Z

It is (336/14) ** 2 = 576 patches.
The number 14 refers to the patch size, not the number of patches for each dimension.