EVA
Exploring the Limits of Masked Visual Representation Learning at Scale
Yuxin Fang2,1, Wen Wang3,1, Binhui Xie4,1, Quan Sun1, Ledell Wu1, Xinggang Wang2, Tiejun Huang1, Xinlong Wang1, Yue Cao1
We launch EVA, a vision-centric foundation model to Explore the limits of Visual representation at scAle using only publicly accessible data and academic resources. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features (i.e., CLIP features) conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks.
EVA is the first open-sourced 1-billion-parameter vision foundation model that achieves state-of-the-art performance on a broad range of downstream tasks.
News
Nov 20, 2022
: release code & model of pre-training and image classification.Nov 18, 2022
: release wandb log & statistics of 1.3B EVA-CLIP training.
Summary of EVA's performance
image & video classification
image classification | video classification | ||||||
---|---|---|---|---|---|---|---|
model | #param. | IN-1K | IN-1K, zero-shot | 12 avg. zero-shot | K400 | K600 | K700 |
EVA | 1.0B | 89.7 | 78.2 | 72.5 | 89.7 | 89.8 | 82.9 |
object detection & segmentation
COCO object detection & instance segmentation | LVIS object detection & instance segmentation | semantic segmentation | |||||||
---|---|---|---|---|---|---|---|---|---|
model | #param. | COCO det (test) | COCO det (val) | COCO ins seg (test) | COCO ins seg (val) | LVIS det | LVIS ins seg | COCO-Stuff | ADE20K |
EVA | 1.0B | 64.7 | 64.5 | 55.5 | 55.0 | 62.2 | 55.0 | 53.4 | 62.3 |
Citation
If you find our work helpful, please star this repo and cite the related articles. Thanks for your support!
@article{EVA,
title={EVA: Exploring the Limits of Masked Visual Representation Learning at Scale},
author={Fang, Yuxin and Wang, Wen and Xie, Binhui and Sun, Quan and Wu, Ledell and Wang, Xinggang and Huang, Tiejun and Wang, Xinlong and Cao, Yue},
journal={arXiv preprint arXiv:2211.07636},
year={2022}
}
We are Hiring
We are hiring at all levels at BAAI Vision Team, including full-time researchers, engineers and interns.
If you are interested in working with us on foundation model, self-supervised learning and multimodal learning, please contact Yue Cao (caoyue@baai.ac.cn
) and Xinlong Wang (wangxinlong@baai.ac.cn
).