ViTMatte🐒

Boosting Image Matting with Pretrained Plain Vision Transformers

Jingfeng Yao¹, Xinggang Wang^{1 📧}, Shusheng Yang¹, Baoyuan Wang²

¹ School of EIC, HUST, ² Xiaobing.AI

(^📧) corresponding author.

News

May 24th, 2024: ViTMatte has been brought to The Foundry's Nuke. Here is a bilibili tutorial. Thanks a lot!
Oct 19th, 2023: ViTMatte has been accepted by Information Fusion (IF=18.6)!
Sep 21th, 2023: ViTMatte is now available in 🤗HuggingFace Transformers! Many thanks to Niels!
June 12th, 2023: We released google colab demo. Try ViTMatte online!
June 9th, 2023: Many thanks to Lucas for creating ViT and twitting our ViTMatte paper!
June 8th, 2023: Matte Anything is released! If you like ViTMatte, you may also like Matte Anything.
May 27th, 2023: We released pretrained weights of ViTMatte!
May 25th, 2023: We released codes of ViTMatte. The pretrained models will be coming soon!
May 24th, 2023: We released our paper on arxiv.

Introduction

Plain Vision Transformer could also do image matting with simple ViTMatte framework!

Recently, plain vision Transformers (ViTs) have shown impressive performance on various computer vision tasks, thanks to their strong modeling capacity and large-scale pretraining. However, they have not yet conquered the problem of image matting. We hypothesize that image matting could also be boosted by ViTs and present a new efficient and robust ViT-based matting system, named ViTMatte. Our method utilizes (i) a hybrid attention mechanism combined with a convolution neck to help ViTs achieve an excellent performance-computation trade-off in matting tasks. (ii) Additionally, we introduce the detail capture module, which just consists of simple lightweight convolutions to complement the detailed information required by matting. To the best of our knowledge, ViTMatte is the first work to unleash the potential of ViT on image matting with concise adaptation. It inherits many superior properties from ViT to matting, including various pretraining strategies, concise architecture design, and flexible inference strategies. We evaluate ViTMatte on Composition-1k and Distinctions-646, the most commonly used benchmark for image matting, our method achieves state-of-the-art performance and outperforms prior matting works by a large margin.

Get Started

Demo

You could try to matting the demo image with its corresponding trimap by run:

python run_one_image.py \
    --model vitmatte-s \
    --checkpoint-dir path/to/checkpoint

The demo images will be saved in ./demo. You could also try with your own image and trimap with the same file.

Besides, you can also try ViTMatte in . It is a simple demo to show the ability of ViTMatte.

Results

Quantitative Results on Composition-1k

Model	SAD	MSE	Grad	Conn	checkpoints
ViTMatte-S	21.46	3.3	7.24	16.21	GoogleDrive
ViTMatte-B	20.33	3.0	6.74	14.78	GoogleDrive

Quantitative Results on Distinctions-646

Model	SAD	MSE	Grad	Conn	checkpoints
ViTMatte-S	21.22	2.1	8.78	17.55	GoogleDrive
ViTMatte-B	17.05	1.5	7.03	12.95	GoogleDrive

Citation

@article{yao2024vitmatte,
  title={ViTMatte: Boosting image matting with pre-trained plain vision transformers},
  author={Yao, Jingfeng and Wang, Xinggang and Yang, Shusheng and Wang, Baoyuan},
  journal={Information Fusion},
  volume={103},
  pages={102091},
  year={2024},
  publisher={Elsevier}
}

hustvl/ViTMatte