Questions about CLIP-ViT-L/14@336px
Opened this issue · 1 comments
Wgkai commented
Thank you for your amazing work.I searched online and found that the CLIP-ViT-L/14@336px model divides an image into 14*14=196 patches, and the embedding dimension is 768. In your work the shape of features after CLIP visual encoder is (576,1024). How does it come?
jzhang38 commented
It is (336/14) ** 2 = 576 patches.
The number 14 refers to the patch size, not the number of patches for each dimension.