whwu95/BIKE

The parameters of BIKE are smaller than the original CLIP ViT-L/14 is that in the BIKE model.

leexinhao opened this issue · 0 comments

          The reason why the parameters of BIKE are smaller than the original CLIP ViT-L/14 is that in the BIKE model, we only utilize the vision encoder from CLIP and do not include the parameters of CLIP's text encoder.

Originally posted by @whwu95 in #3 (comment)

In fact, the parameters of visual encoder is 303M for ViT-L/16, which excludes text encoder.