openai/CLIP

Bigger models release ?

rom1504 opened this issue ยท 19 comments

Hi,
Thanks for these amazing results and for releasing the code and ViT-B/32 weights!
Do you plan to also release the 3 bigger models you mention in the paper ?

Hi! We will be releasing the RN50 model soon, but we haven't decided when/whether we will release the other models. I hope to tell you good news in the near future!

Hope to see the visual transformer models included as well!

ekCSU commented

It really helps the research community (especially with lower budget) to be able to try out the state of the art ML. CLIP is a simple and elegant idea that many applications/research can enjoy it. But the smallest released model is just not performing very well. We greatly appreciate if OpenAI releases the larger models. Thanks.

Agreed with @ekCSU ! I have a bunch of things I'd like to test, namely the power of zero-shot inference... Will the larger model do better with questions like these?

https://twitter.com/metasemantic/status/1348113145609465856

Another upvote. Contrary to what ekCSU said above the vit/B32 has been impressive in terms of its ability to generalise to weird domains already. Personally I am particularly interested in a pretrained hybrid model, which uses a convolutional backbone but with a transformer rather than maxpool afterwards. What the paper shows in my perception is that you 'can' train SOTA models using transformers alone if you have the compute and data, but not that its necessarily the most efficient or natural choice. Seems to me that the tiling boundaries in a pure vision transformer must lead to funny/suboptimal behavior at some level. Curious to see how those intuitions play out in my problem domain.

It sounds to me like: https://github.com/CompVis/taming-transformers

The original 'AN IMAGE IS WORTH 16X16 WORDS' paper investigates them under the label of 'hybrid' models; and if you look at fig 5, you can see they offer the best performance / training FLOPS tradeoff. Pure transformer reaches the highest absolute performance but also with a much bigger compute budget. The experiments ive seen reported on give no indication that a hybrid shouldnt be able to keep up if given the bigger compute budget as well. For me the takeaway of the ViT paper isnt 'lets do away with convolutions completely'. Yeah they demonstrate that you can, at least for purposes of classification, which is cool and all from a theoretical pov; but not that you should.

For most applications and indeed for generative purposes, a hybrid transformer-convolution architecture seems much more sensible to me. Attention is great and all but that image tiling mechanism just seems completely unnatural ; and one might as well get the benefit of attentive reasoning on the higher level feature maps. Conv filters do a fine job of detecting edges and assembling them into higher level features. Transformers would be a great tool on top of that to check if cats ears are actually sitting on top of its head and all that. At least thats based on theoretical reasoning and what little direct comparisons ive seen so I could be wrong; but thats why id love to try for myself.

The BoTNet paper is also trashtalking pure transformers pretty hard; though I think they basically demonstrate the same point mostly; that without a CLIP-like training pure transformers lack the data-efficiency to be trained well. Sadly they do not seem to directly address the question how a hybrid transformer as per the ViT paper actually compares to their shuffling around of the transformer and 1x1/dense layers, as they schematically contrast in fig 3. Given how obvious a comparison that is, we can safely assume the implication is that the hybrid ViT turns out to be (marginally) superior.

Hi! We will be releasing the RN50 model soon, but we haven't decided when/whether we will release the other models. I hope to tell you good news in the near future!

Any updates on this? Are you able to share anything about what is factoring into the decision? I would love to try out the larger models.

Thank you for releasing b/16, it seems to be very interesting for guiding image generative models, more coherent and achieves finer detail and more defined shapes. I would LOVE to try it with vit-H-14 though

is there any advice on how to modify one of the imagenet pretrained vits to work as a placeholder until its officially released? Not sure if that's possible, but if it is i would love to know how. I've tried my best but the different naming of the folder structure of the pretrained vit from google vision and the clip model are just too different to get anywhere fast. I can wait for the official clip version though.

A similar approach would be to learn a layer or two on top of one the official ViT models (or any other vision models) to align with CLIP's feature space. No idea how well/better it would work though. It should be easier and more flexible to directly use one of the pytorch image models than retrofitting other models into CLIP's existing VisionTransformer class, since tiny implementation details may differ.

FYI to those watching it looks like 2 bigger models were released in a recent commit. Thank you @jongwook!

For example if it looks like a water drop is falling towards a surface it goes ahead and makes it splash, even though it wasn't necessarily in the prompt. It takes its own creative liberties and makes faces blink, albeit asynchronously between one eye and the other . I assume this is coming from clip but who knows.. this is why I'm so interested in what a model with a 1gb weight will do with having seen so much.

I'm sure it could animate entire sequences and much more than simply "label probs" lol

Nice! Thank you for the heads-up!

OpenAI stealth released the model weights for the largest CLIP models: RN50x64 & ViT-L/14

Change the model name from ViT-B/16 to ViT-L/14 when you load the checkpoint to enjoy this beefed-up version of CLIP!

#MachineLearning #generativeart #vqganclip #generative #AI

Pic

Fixed in #234