/ViTamin

[CVPR 2024] Official implementation of "ViTamin: Designing Scalable Vision Models in the Vision-language Era"

Primary LanguagePythonApache License 2.0Apache-2.0

🔥 Officially supported by timm and OpenCLIP. Thanks @rwightman!

One line of code to call ViTamin:

model = timm.create_model('vitamin_xlarge_384')

ViTamin-XL, with only 436M parameters and trained on the public DataComp-1B dataset, achieves an impressive 82.9% zero-shot ImageNet accuracy.

ViTamin-L sets a new SOTA across seven benchmarks for open-vocabulary segmentation, and also push forward the capabilities of large multi-modal models (e.g., LLaVA) significantly.

🤗 The HuggingFace collection of ViTamin model cards has been released! Check out the model cards!

teaser

Get Started

It currently includes code and models for the following tasks:

ViTamin Pre-training: See ./ViTamin/README.md for a quick start, which includes CLIP pre-training / fine-tuning pipelines and zero-shot evaluation pipelines.

Open-vocabulary Detection and Segmentation: See ViTamin for Open-vocab Detection and ViTamin for Open-vocab Segmentation.

Large Multi-Modal Models: See ViTamin for Large Multi-Modal Models.

We also support ViTamin with Hugging Face model jienengchen/ViTamin-XL-384px.

import torch
import open_clip
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    'jienengchen/ViTamin-XL-384px',
    trust_remote_code=True).to(device).eval()

image = Image.open('./image.png').convert('RGB')
image_processor = CLIPImageProcessor.from_pretrained('jienengchen/ViTamin-XL-384px')

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K')
text = tokenizer(["a photo of vitamin", "a dog", "a cat"]).to(device)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features, text_features, logit_scale = model(pixel_values, text)
    text_probs = (100.0 * image_features @ text_features.to(torch.float).T).softmax(dim=-1)

print("Label probs:", text_probs) 

Main Results with CLIP Pre-training on DataComp-1B

We will provide 61 trained VLMs (48 benchmarked + 13 best performing) in Hugging Face for community use. Stay tuned!

image encoder 🤗 HuggingFace image size num patches text encoder depth/width seen samples (B) trainable params Image+Text (M) MACs Image+Text (G) ImageNet Acc. avg. 38 datasets ImageNet dist. shift. VTAB retrieval
ViTamin-L Link 224 196 12/768 12.8 333.3+123.7 72.6+6.6 80.8 66.7 69.8 65.3 60.3
ViTamin-L Link 256 256 12/768 12.8+0.2 333.4+123.7 94.8+6.6 81.2 67.0 71.1 65.3 61.2
ViTamin-L Link 336 441 12/768 12.8+0.2 333.6+123.7 163.4+6.6 81.6 67.0 72.1 64.4 61.6
ViTamin-L Link 384 576 12/768 12.8+0.2 333.7+123.7 213.4+6.6 81.8 67.2 72.4 64.7 61.8
ViTamin-L2 Link 224 196 24/1024 12.8 333.6+354.0 72.6+23.3 80.9 66.4 70.6 63.4 61.5
ViTamin-L2 Link 256 256 24/1024 12.8+0.5 333.6+354.0 94.8+23.3 81.5 67.4 71.9 64.1 63.1
ViTamin-L2 Link 336 441 24/1024 12.8+0.5 333.8+354.0 163.4+23.3 81.8 67.8 73.0 64.5 63.6
ViTamin-L2 Link 384 576 24/1024 12.8+0.5 334.0+354.0 213.4+23.3 82.1 68.1 73.4 64.8 63.7
ViTamin-XL Link 256 256 27/1152 12.8+0.5 436.1+488.7 125.3+33.1 82.1 67.6 72.3 65.4 62.7
ViTamin-XL Link 384 576 27/1152 12.8+0.5 436.1+488.7 281.9+33.1 82.6 68.1 73.6 65.6 63.8
ViTamin-XL Link 256 256 27/1152 40 436.1+488.7 125.3+33.1 82.3 67.5 72.8 64.0 62.1
ViTamin-XL Link 336 441 27/1152 40+1 436.1+488.7 215.9+33.1 82.7 68.0 73.9 64.1 62.6
ViTamin-XL Link 384 576 27/1152 40+1 436.1+488.7 281.9+33.1 82.9 68.1 74.1 64.0 62.5

Main Results on Downstream tasks

Open-Vocab Detection

image encoder detector OV-COCO (AP50novel) OV-LVIS (APr)
ViT-L/14 Sliding F-ViT 36.1 32.5
ViTamin-L Sliding F-ViT 37.5 35.6

Open-Vocab Segmentation

image encoder segmentor ADE Cityscapes MV A-150 A-847 PC-459 PC-59 PAS-21
ViT-L/14 Sliding FC-CLIP 24.6 40.7 16.5 31.8 14.3 18.3 55.1 81.5
ViTamin-L Sliding FC-CLIP 27.3 44.0 18.2 35.6 16.1 20.4 58.4 83.4

Note: Panoptic dataset (ADE, CityScapes, MV) are with the metric of PQ. Semantic dataset (A-150, A-847, PC-459, PC-59, PAS-21) are with the metric of mIoU.

Large Multi-modal Models

image encoder image size VQAv2 GQA VizWiz SQA T-VQA POPE MME MM-Bench MM-B-CN SEED LLaVA-Wild MM-Vet
ViTamin-L 336 78.4 61.6 51.1 66.9 58.7 84.6 1421 65.4 58.4 57.7 64.5 33.6
ViTamin-L 384 78.9 61.6 55.4 67.6 59.8 85.5 1447 64.5 58.3 57.9 66.1 33.6

Citing ViTamin

@inproceedings{chen2024vitamin,
  title={ViTamin: Designing Scalable Vision Models in the Vision-language Era},
  author={Chen, Jieneng and Yu, Qihang and Shen, Xiaohui and Yuille, Alan and Chen, Liang-Chieh},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}