/mamba-clip

CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation

Primary LanguagePython

CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation

[Paper][🤗Ckpts]

Abstract

'''State space models and Mamba-based models have been increasingly applied across various domains, achieving state-of-the-art performance. This technical report introduces the first attempt to train a transferable Mamba model utilizing contrastive language-image pretraining (CLIP). We have trained Mamba models of varying sizes and undertaken comprehensive evaluations of these models on 26 zero-shot classification datasets and 16 out-of-distribution (OOD) datasets. Our findings reveal that a Mamba model with 67 million parameters is on par with a 307 million-parameter Vision Transformer (ViT) model in zero-shot classification tasks, highlighting the parameter efficiency of Mamba models. In tests of OOD generalization, Mamba-based models exhibit exceptional performance in conditions of OOD image contrast or when subjected to high-pass filtering. However, a Hessian analysis indicates that Mamba models feature a sharper and more non-convex landscape compared to ViT-based models, making them more challenging to train.'''

Main results

Zero-shot performance of different architectures trained with CLIP

Methods Food-101 CIFAR-10 CIFAR-100 CUB SUN397 Cars Aircraft DTD Pets Caltech-101 Flowers MNIST FER-2013 STL-10 EuroSAT RESISC45 GTSRB KITTI Country211 PCAM UCF101 Kinetics700 CLEVR HatefulMemes SST2 ImageNet
VMamba_B (89M) 48.5 58.0 29.9 36.5 50.4 5.8 8.5 26.5 30.2 64.7 52.8 9.7 19.6 91.9 16.0 30.4 7.9 40.2 10.2 59.9 35.2 25.6 12.6 51.6 50.1 38.3
VMamba_S (50M) 49.4 70.3 34.3 39.1 53.9 6.9 8.4 26.0 31.3 68.7 54.1 10.1 9.8 92.8 17.6 31.4 6.9 23.5 10.9 54.2 38.4 27.1 13.2 50.5 50.0 40.0
VMamba_T220 (30M) 46.5 50.9 22.9 35.6 51.1 5.7 6.8 25.1 31.0 64.9 54.0 10.1 12.5 91.6 13.9 25.4 10.7 32.3 9.9 55.0 34.0 25.1 12.7 53.9 50.6 38.7
Simba_L (66.6M) 52.7 67.4 31.0 39.1 52.7 6.9 9.1 27.8 33.4 68.9 55.9 8.0 16.0 93.9 17.4 32.3 8.9 41.5 11.1 58.1 35.7 27.9 12.1 54.9 50.1 41.6
VIT_B(84M) 50.6 66.0 34.5 38.8 51.1 4.0 5.4 21.2 28.5 60.9 53.3 8.4 17.3 90.5 30.2 21.5 6.1 35.1 10.5 53.5 28.5 22.1 10.8 52.4 50.7 37.6
VIT-L(307M) 59.5 72.9 41.5 40.3 53.6 6.9 6.4 20.6 27.9 65.4 55.0 10.3 34.5 94.2 22.7 28.8 5.8 41.4 12.5 54.9 34.3 24.0 12.9 54.3 50.1 40.4

Acknowledgment

This project is based on A-CLIP (paper, code), VMamba (paper, code), SiMBA (paper, code), thanks for their excellent works.