Collecting papers about visual foundation / general / univeral model.
A visual model that was trained on a large amount of data recently performed admirably as a genralist.
Any contributions, comments are welcome.
Conference / Journal | Paper | Zero-shot result |
---|---|---|
arXiv:2001.07966 | ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data | COCO R@1: 32.3 |
arXiv:2103.00020 | CLIP: Learning Transferable Visual Models From Natural Language Supervision | COCO R@1: 37.8 |
arXiv:2102.05918 | ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | COCO R@1: 45.6 |
arXiv:2111.07783 | FLIP: FINE-GRAINED INTERACTIVE LANGUAGEIMAGE PRE-TRAINING | COCO R@1: 45.9 |
arXiv:2111.11432 | Florence: A New Foundation Model for Computer Vision | COCO R@1: 47.2 |
Conference / Journal | Paper | Zero-shot result (Top-1 Acc.) |
---|---|---|
arXiv:2103.00020 | CLIP: Learning Transferable Visual Models From Natural Language Supervision | 10%: 72.6 (ResNet-50), 100%: 73.3 (ResNet-50), 80.2 (ViT-B/16) |
arXiv:2102.05918 | ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | 100%: 76.4 (EfficientNet-L2) |
arXiv:2111.07783 | FLIP: FINE-GRAINED INTERACTIVE LANGUAGEIMAGE PRE-TRAINING | 100%: 78.3 (ViT-L/14) |
arXiv:2111.11432 | Florence: A New Foundation Model for Computer Vision | 100%: 83.7 (CoSwin-H@384) |
Conference / Journal | Paper | Zero-shot result (mAP) |
---|---|---|
arXiv:2111.11432 | Florence: A New Foundation Model for Computer Vision | BCCD: 15.3, Oxford Pets: 68.9 |
Conference / Journal | Paper | Zero-shot result (Top-1 Acc.) |
---|---|---|
arXiv:2103.00020 | CLIP: Learning Transferable Visual Models From Natural Language Supervision | ImageNet 100%: 73.3 (ResNet-50), 80.2 (ViT-B/16) |
arXiv:2111.08687 | INTERN: A New Learning Paradigm Towards General Vision | ImageNet 100%: 88.4 (MN-B15) |