[FEATURE] Add image backbones from `MobileCLIP` paper

Question

[FEATURE] Add image backbones from `MobileCLIP` paper

rsomani95 opened this issue 10 months ago · 7 comments

MobileCLIP is a really fast CLIP architecture for mobile inference - about 3x faster than the fastest publicly available CLIP backbone convnext_base_w for inference on iOS / macOS devices.

They introduce 3 novel image backbones: mci{0|1|2}. It would be amazing if these models were available directly via timm. I believe this would be an essential first step towards getting it into open_clip for fine-tuning.

The arch, defined here, uses MobileOne and FastVIT components, which are already available in timm. I'm not sure how compatible the re-implementation there is with the existing one in timm out of the box, but it smells like integration is definitely possible.

Answer 1 · 2024-03-18T19:09:56.000Z

@rsomani95 the components themselves are equivalent at a functional level, but the naming was remapped, so would have to remap for this model as well...

Answer 2 · 2024-03-21T20:17:25.000Z

@rsomani95 I took a closer look at this s1/s2 (mc1/mc2) are the easiest, could probably map those to OpenCLIP w/ a timm FastViT encoder (after a few additions and a key remapping for weights). I think the text encoder for those is compatible.

S0 uses a repmixer based text encoder so would need new code in OpenCLIP as well. The image encoder would map to a tweaked ver of FastViT.

The B model uses a ViT w/ a different stem, doable. I really like ViT NOT having BatchNorm though so a shame that it's now a ViT Base w/ BN in the stem.

Answer 3 · 2024-03-21T20:51:54.000Z

@rwightman thanks for looking into that. That's really great to hear re. s1/s2 as those, in my eyes, sit in the perfect sweetspot of speed + accuracy. Given your observations, maybe it makes sense to port those two alone first? Is there something in particular I could help with?

Answer 4 · 2024-06-14T16:52:12.000Z

@rwightman Apple just released timm and OpenCLIP checkpoints: https://huggingface.co/collections/apple/mobileclip-models-datacompdr-data-665789776e1aa2b59f35f7c8

Answer 5 · 2024-06-14T17:26:28.000Z

@rsomani95 yup, I was co-ordinating with them to set it up. timm and OpenCLIP are already pointing at those checkpoints.

Answer 6 · 2024-06-14T17:38:17.000Z

also worth pointing out, timm is supporting all of the models, incl s0 as it's image-tower only. OpenCLIP isn't supporting S0 because it is too much extra work to support the RepMixer based text tower for just that one model. The other models have a standard text-tower.

Answer 7 · 2024-06-14T19:29:01.000Z

Awesome. Excited to use these!
Thanks for helping out with that.