EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction (paper, poster)

News

If you are interested in getting updates, please join our mailing list here.

[2024/03/19] Online demo of EfficientViT-SAM is available: https://evitsam.hanlab.ai/.
[2024/02/08] Tech report of EfficientViT-SAM is available: arxiv.
[2024/02/07] We released EfficientViT-SAM, the first accelerated SAM model that matches/outperforms SAM-ViT-H's zero-shot performance, delivering the SOTA performance-efficiency trade-off.
[2023/11/20] EfficientViT is available in the NVIDIA Jetson Generative AI Lab.
[2023/09/12] EfficientViT is highlighted by MIT home page and MIT News.
[2023/07/18] EfficientViT is accepted by ICCV 2023.

About EfficientViT Models

EfficientViT is a new family of ViT models for efficient high-resolution dense prediction vision tasks. The core building block of EfficientViT is a lightweight, multi-scale linear attention module that achieves global receptive field and multi-scale learning with only hardware-efficient operations, making EfficientViT TensorRT-friendly and suitable for GPU deployment.

Third-Party Implementation/Integration

Getting Started

conda create -n efficientvit python=3.10
conda activate efficientvit
conda install -c conda-forge mpi4py openmpi
pip install -r requirements.txt

EfficientViT Applications

Segment Anything

Model	Resolution	COCO mAP	LVIS mAP	Params	MACs	Jetson Orin Latency (bs1)	A100 Throughput (bs16)	Checkpoint
EfficientViT-SAM-L0	512x512	45.7	41.8	34.8M	35G	8.2ms	762 images/s	link
EfficientViT-SAM-L1	512x512	46.2	42.1	47.7M	49G	10.2ms	638 images/s	link
EfficientViT-SAM-L2	512x512	46.6	42.7	61.3M	69G	12.9ms	538 images/s	link
EfficientViT-SAM-XL0	1024x1024	47.5	43.9	117.0M	185G	22.5ms	278 images/s	link
EfficientViT-SAM-XL1	1024x1024	47.8	44.4	203.3M	322G	37.2ms	182 images/s	link

Table1: Summary of All EfficientViT-SAM Variants. COCO mAP and LVIS mAP are measured using ViTDet's predicted bounding boxes as the prompt. End-to-end Jetson Orin latency and A100 throughput are measured with TensorRT and fp16.

Image Classification

Semantic Segmentation

Contact

Han Cai: hancai@mit.edu

TODO

ImageNet Pretrained models
Segmentation Pretrained models
ImageNet training code
EfficientViT L series, designed for cloud
EfficientViT for segment anything
EfficientViT for image generation
EfficientViT for CLIP
EfficientViT for super-resolution
Segmentation training code

Citation

If EfficientViT is useful or relevant to your research, please kindly recognize our contributions by citing our paper:

@article{cai2022efficientvit,
  title={Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition},
  author={Cai, Han and Gan, Chuang and Han, Song},
  journal={arXiv preprint arXiv:2205.14756},
  year={2022}
}

binh234/efficientvit