/sparseformer

(ICLR 2024, CVPR 2024) SparseFormer

Primary LanguagePythonMIT LicenseMIT

🎆 SparseFormer

This is the offical repo for SparseFormer researches:

SparseFormer: Sparse Visual Recognition via Limited Latent Tokens (ICLR 2024)
Ziteng Gao, Zhan Tong, Limin Wang, Mike Zheng Shou

Bootstrapping SparseFormers from Vision Foundation Models (CVPR 2024)
Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou

Out-of-box SparseFormer as a Library (recommended)

We provide the out-of-box SparseFormer usage with the sparseformer library installation.

Getting started. You can install sparseformer as a library by the following command:

pip install -e sparseformer # in this folder

Available pre-trained model weights are listed here, including weights of v1 and bootstrapped ones. You can simply use create_model with the argument download=True to get pre-trained models. You can play like this!

from sparseformer.factory import create_model

# e.g., make a SparseFormer v1 tiny model
model = create_model("sparseformer_v1_tiny", download=True)


# or make a CLIP SparseFormer large model and put it in OpenClip pipeline
import open_clip
clip = open_clip.create_model_and_transforms("ViT-L-14", "openai")
visual = create_model("sparseformer_btsp_openai_clip_large", download=True)
clip.visual = visual
# ...

Video SparseFormers. We also provide unified MediaSparseFormer implementation for both video and image inputs (an image as single-frame video) with the token inflation argument replicates. MediaSparseFormer can load pre-trained weights of the image SparseFormer by load_2d_state_dict.

Notes: Pre-trained weights VideoSparseFormers are currently unavailable. We might reproduce VideoSparseFormers if highly needed by the community.

ADVANCED: Make your own SparseFormer and load timm weights. Our codebase is generally compatible with timm vision transformer weights. So here comes something to play: you can make your own SparseFormer and load timm transformers weights, not limited to our provided configurations!

For example, you can make a SparseFormer similar to ViT-224/16 and with sampling & decoding and roi adjusting every 3 block, and load it with CLIP OpenAI official pre-trained weights:

from sparseformer.modeling import SparseFormer, OP
from sparseformer.config import base_btsp_config

ops_list = []
num_layers = 12
for i in range(num_layers):
    if i % 3 == 0:
        ops_list.append([OP.SAMPLING_M, OP.ATTN, OP.MLP, OP.ROI_ADJ, OP.PE_INJECT,])
    else:
        ops_list.append([OP.ATTN, OP.MLP])

config = base_btsp_config()
config.update(
    num_latent_tokens=16,
    num_sampling_points=9,
    width_configs=[768, ]*num_layers,
    repeats=[1, ]*num_layers,
    ops_list=ops_list,
)

model = SparseFormer(**config)

import timm
pretrained = timm.create_model("vit_base_patch16_clip_224.openai", pretrained=True)
new_dict = dict()
old_dict = pretrained.state_dict()
for k in old_dict:
    nk = k
    if "blocks" in k:
        nk = nk.replace("blocks", "layers")
    new_dict[nk] = old_dict[k]
print(model.load_state_dict(new_dict, strict=False))

All weights attention and MLP layers should be successfully loaded. The resulted SparseFormer should be fine-tuned to output meaningful results since the sampling & decoding and roi adjusting part are newly initialized. Maybe you can fine-tune it to be a CLIP-based open-vocabulary detector (have not yet tried, but very promising imo! :D).

Training (SparseFormer v1)

For training SparseFormer v1 in ImageNets (SparseFormer: Sparse Visual Recognition via Limited Latent Tokens), please check imagenet.

Note: this imagenet sub-codebase will be refactored soon.

Citation

If you find SparseFormer useful in your research or work, please consider citing us using the following entry:

@inproceedings{gao2024sparseformer,
  author       = {Ziteng Gao and
                  Zhan Tong and
                  Limin Wang and
                  Mike Zheng Shou},
  title        = {SparseFormer: Sparse Visual Recognition via Limited Latent Tokens},
  booktitle    = {{ICLR}},
  publisher    = {OpenReview.net},
  year         = {2024}
}

@inproceedings{gao2024bootstrapping,
  author       = {Ziteng Gao and
                  Zhan Tong and
                  Kevin Qinghong Lin and
                  Joya Chen and
                  Mike Zheng Shou},
  title        = {Bootstrapping SparseFormers from Vision Foundation Models},
  booktitle    = {{CVPR}},
  pages        = {17710--17721},
  publisher    = {{IEEE}},
  year         = {2024}
}