[FEATURE] Add image backbones from `MobileCLIP` paper
rsomani95 opened this issue · 7 comments
MobileCLIP is a really fast CLIP architecture for mobile inference - about 3x faster than the fastest publicly available CLIP backbone convnext_base_w
for inference on iOS / macOS devices.
They introduce 3 novel image backbones: mci{0|1|2}
. It would be amazing if these models were available directly via timm
. I believe this would be an essential first step towards getting it into open_clip
for fine-tuning.
The arch, defined here, uses MobileOne and FastVIT components, which are already available in timm
. I'm not sure how compatible the re-implementation there is with the existing one in timm
out of the box, but it smells like integration is definitely possible.
@rsomani95 the components themselves are equivalent at a functional level, but the naming was remapped, so would have to remap for this model as well...
@rsomani95 I took a closer look at this s1/s2 (mc1/mc2) are the easiest, could probably map those to OpenCLIP w/ a timm FastViT encoder (after a few additions and a key remapping for weights). I think the text encoder for those is compatible.
S0 uses a repmixer based text encoder so would need new code in OpenCLIP as well. The image encoder would map to a tweaked ver of FastViT.
The B model uses a ViT w/ a different stem, doable. I really like ViT NOT having BatchNorm though so a shame that it's now a ViT Base w/ BN in the stem.
@rwightman thanks for looking into that. That's really great to hear re. s1/s2 as those, in my eyes, sit in the perfect sweetspot of speed + accuracy. Given your observations, maybe it makes sense to port those two alone first? Is there something in particular I could help with?
@rwightman Apple just released timm and OpenCLIP checkpoints: https://huggingface.co/collections/apple/mobileclip-models-datacompdr-data-665789776e1aa2b59f35f7c8
@rsomani95 yup, I was co-ordinating with them to set it up. timm and OpenCLIP are already pointing at those checkpoints.
also worth pointing out, timm is supporting all of the models, incl s0 as it's image-tower only. OpenCLIP isn't supporting S0 because it is too much extra work to support the RepMixer based text tower for just that one model. The other models have a standard text-tower.
Awesome. Excited to use these!
Thanks for helping out with that.