A foundation model is a large-scale pretrained model (e.g., BERT, DALL-E, GPT-3) that can be adapted to a wide range of downstream applications. This term was first popularized by the Stanford Institute for Human-Centered Artificial Intelligence. This repository maintains a curated list of foundation models for vision and language tasks. Research papers without code are not included.
- [2023.10] Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision (from SDSU)
- [2023.09] Multimodal Foundation Models: From Specialists to General-Purpose Assistants (from Microsoft)
- [2023.08] Towards Generalist Foundation Model for Radiology (from SJTU)
- [2023.07] Foundational Models Defining a New Era in Vision: A Survey and Outlook (from MBZ University of AI)
- [2023.07] Towards Generalist Biomedical AI (from Google)
- [2023.07] Foundational Models Defining a New Era in Vision: A Survey and Outlook
- [2023.07] A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models (from University of Oxford)
- [2023.06] Large Multimodal Models: Notes on CVPR 2023 Tutorial (from Chunyuan Li, Microsoft.)
- [2023.06] A Survey on Multimodal Large Language Models
- [2023.04] Vision-Language Models for Vision Tasks: A Survey
- [2023.04] Foundation Models for Generalist Medical Artificial Intelligence
- [2023.03] A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT
- [2023.03] A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT
- [2022.12] Vision-language pre-training: Basics, recent advances, and future trends
- [2022.07] On the Opportunities and Risks of Foundation Models (This survey first popularizes the concept of foundation model; from Standford)
- FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects (from NVIDIA)
- Tracking Everything Everywhere All at Once (from Cornell, ICCV 2023 best student paper)
- Foundation Models for Generalist Geospatial Artificial Intelligence (from IBM and NASA)
- LLaMA 2: Open Foundation and Fine-Tuned Chat Models (from Meta)
- InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition (from Shanghai AI Lab)
- The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World (from Shanghai AI Lab)
- Meta-Transformer: A Unified Framework for Multimodal Learning (from CUHK and Shanghai AI Lab)
- Retentive Network: A Successor to Transformer for Large Language Models (from Microsoft and Tsinghua University)
- Neural World Models for Computer Vision (PhD Thesis of Anthony Hu from University of Cambridge)
- Recognize Anything: A Strong Image Tagging Model (a strong foundation model for image tagging. from OPPO)
- Towards Visual Foundation Models of Physical Scenes (describes a first step towards learning general-purpose visual representations of physical scenes using only image prediction as a training criterion; from AWS)
- LIMA: Less Is More for Alignment (65B parameters, from Meta)
- PaLM 2 Technical Report (from Google)
- IMAGEBIND: One Embedding Space To Bind Them All (from Meta)
- SEEM: Segment Everything Everywhere All at Once (from University of Wisconsin-Madison, HKUST, and Microsoft)
- SAM: Segment Anything (the first foundation model for image segmentation; from Meta)
- SegGPT: Segmenting Everything In Context (from BAAI, ZJU, and PKU)
- Images Speak in Images: A Generalist Painter for In-Context Visual Learning (from BAAI, ZJU, and PKU)
- UniDector: Detecting Everything in the Open World: Towards Universal Object Detection (CVPR, from Tsinghua and BNRist)
- Unmasked Teacher: Towards Training-Efficient Video Foundation Models (from Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai AI Laboratory)
- Visual Prompt Multi-Modal Tracking (from Dalian University of Technology and Peng Cheng Laboratory)
- Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks (from ByteDance)
- EVA-CLIP: Improved Training Techniques for CLIP at Scale (from BAAI and HUST)
- EVA-02: A Visual Representation for Neon Genesis (from BAAI and HUST)
- EVA-01: Exploring the Limits of Masked Visual Representation Learning at Scale (CVPR, from BAAI and HUST)
- LLaMA: Open and Efficient Foundation Language Models (A collection of foundation language models ranging from 7B to 65B parameters; from Meta)
- The effectiveness of MAE pre-pretraining for billion-scale pretraining (from Meta)
- BloombergGPT: A Large Language Model for Finance (50 billion parameters; from Bloomberg)
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (this work was coordinated by BigScience whose goal is to democratize LLMs.)
- FLIP: Scaling Language-Image Pre-training via Masking (from Meta)
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (from Saleforce Research)
- GPT-4 Technical Report (from OpenAI)
- Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models (from Microsoft Research Asia)
- UNINEXT: Universal Instance Perception as Object Discovery and Retrieval (a unified model for 10 instance perception tasks; CVPR, from ByteDance)
- InternVideo: General Video Foundation Models via Generative and Discriminative Learning (from Shanghai AI Lab)
- InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions (CVPR, from Shanghai AI Lab)
- BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning (from Harbin Institute of Technology and Microsoft Research Asia)
- BEVT: BERT Pretraining of Video Transformers (CVPR, from Shanghai Key Lab of Intelligent Information Processing)
- Foundation Transformers (from Microsoft)
- A Generalist Agent (known as Gato, a multi-modal, multi-task, multi-embodiment generalist agent; from DeepMind)
- FIBER: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone (from Microsoft, UCLA, and New York University)
- Flamingo: a Visual Language Model for Few-Shot Learning (from DeepMind)
- MetaLM: Language Models are General-Purpose Interfaces (from Microsoft)
- Point-E: A System for Generating 3D Point Clouds from Complex Prompts (efficient 3D object generation using a text-to-image diffusion model; from OpenAI)
- Image Segmentation Using Text and Image Prompts (CVPR, from University of Göttingen)
- Unifying Flow, Stereo and Depth Estimation (A unified model for three motion and 3D perception tasks; from ETH Zurich)
- PaLI: A Jointly-Scaled Multilingual Language-Image Model (from Google)
- VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training (NeurIPS, from Nanjing University, Tencent, and Shanghai AI Lab)
- SLIP: Self-supervision meets Language-Image Pre-training (ECCV, from UC Berkeley and Meta)
- GLIPv2: Unifying Localization and VL Understanding (NeurIPS'22, from UW, Meta, Microsoft, and UCLA)
- GLIP: Grounded Language-Image Pre-training (CVPR, from UCLA and Microsoft)
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (from Salesforce Research)
- NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis (from Microsoft)
- PaLM: Scaling Language Modeling with Pathways (from Google)
- CoCa: Contrastive Captioners are Image-Text Foundation Models (from Google)
- Parti: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (from Google)
- A Unified Sequence Interface for Vision Tasks (from Google Research, Brain Team)
- Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (from Google)
- Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models (CVPR, from Stability and Runway)
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (BIG-Bench: a 204-task extremely difficult and diverse benchmark for LLMs, 444 authors from 132 institutions)
- CRIS: CLIP-Driven Referring Image Segmentation (from University of Sydney and OPPO)
- Masked Autoencoders As Spatiotemporal Learners (extension of MAE to videos; NeurIPS, from Meta)
- Masked Autoencoders Are Scalable Vision Learners (CVPR 2022, from FAIR)
- InstructGPT: Training language models to follow instructions with human feedback (trained with humans in the loop; from OpenAI)
- A Unified Sequence Interface for Vision Tasks (NeurIPS 2022, from Google)
- DALL-E2: Hierarchical Text-Conditional Image Generation with CLIP Latents (from OpenAI)
- Robust and Efficient Medical Imaging with Self-Supervision (from Google, Georgia Tech, and Northwestern University)
- Video Swin Transformer (CVPR, from Microsoft Research Asia)
- OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework (ICML 2022. from Alibaba.)
- Mask2Former: Masked-attention Mask Transformer for Universal Image Segmentation (CVPR 2022, from FAIR and UIUC)
- FLAVA: A Foundational Language And Vision Alignment Model (CVPR, from Facebook AI Research)
- Towards artificial general intelligence via a multimodal foundation model (Nature Communication, from Renmin University of China)
- FILIP: Fine-Grained Interactive Language-Image Pre-Training (ICLR, from Huawei and HKUST)
- SimVLM: Simple Visual Language Model Pretraining with Weak Supervision (ICLR, from CMU and Google)
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models (from OpenAI)
- Unifying Vision-and-Language Tasks via Text Generation (from UNC-Chapel Hill)
- ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (PMLR, from Google)
- UniT: Multimodal Multitask Learning with a Unified Transformer (ICCV, from FAIR)
- WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training (This paper presents the first large-scale Chinese multimodal pre-training model called BriVL; from Renmin University of China)
- Codex: Evaluating Large Language Models Trained on Code (a GPT language model finetuned on public code from GitHub, from OpenAI and Anthropic AI)
- Florence: A New Foundation Model for Computer Vision (from Microsoft)
- DALL-E: Zero-Shot Text-to-Image Generation (from OpenAI)
- CLIP: Learning Transferable Visual Models From Natural Language Supervision (from OpenAI)
- Multimodal Few-Shot Learning with Frozen Language Models (NeurIPS, from DeepMind)
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV, from Microsoft Research Asia)
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (The first Vision Transfomer with pure self-attention blocks; ICLR, from Google)
- GPT-3: Language Models are Few-Shot Learners (175B parameters; permits in-context learning compared with GPT-2; from OpenAI)
- UNITER: UNiversal Image-TExt Representation Learning (from Microsoft)
- T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (from Google)
- GPT-2: Language Models are Unsupervised Multitask Learners (1.5B parameters; from OpenAI)
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers (EMNLP, from UNC-Chapel Hill)
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (from Google AI Language)
- GPT: Improving Language Understanding by Generative Pre-Training (from OpenAI)
- Attention Is All You Need (NeurIPS, from Google and UoT)
- FLIP: Scaling Language-Image Pre-training via Masking (from Meta)
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (proposes a generic and efficient VLP strategy based on off-the-shelf frozen vision and language models. from Saleforce Research)
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (from Salesforce Research)
- SLIP: Self-supervision meets Language-Image Pre-training (ECCV, from UC Berkeley and Meta)
- GLIP: Grounded Language-Image Pre-training (CVPR, from UCLA and Microsoft)
- ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (PMLR, from Google)
- RegionCLIP: Region-Based Language-Image Pretraining
- CLIP: Learning Transferable Visual Models From Natural Language Supervision (from OpenAI)
- FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects (from NVIDIA)
- SEEM: Segment Everything Everywhere All at Once (from University of Wisconsin-Madison, HKUST, and Microsoft)
- SAM: Segment Anything (the first foundation model for image segmentation; from Meta)
- SegGPT: Segmenting Everything In Context (from BAAI, ZJU, and PKU)
- GPT-4 Technical Report (from OpenAI)
- GPT-3: Language Models are Few-Shot Learners (175B parameters; permits in-context learning compared with GPT-2; from OpenAI)
- GPT-2: Language Models are Unsupervised Multitask Learners (1.5B parameters; from OpenAI)
- GPT: Improving Language Understanding by Generative Pre-Training (from OpenAI)
-
Green AI (introduces the concept of Red AI vs Green AI)
-
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks (the lottery ticket hypothesis, from MIT)
-
Other Challenges and Opportunities: Trust, reliability, safe use, interpretability, self-improvement, adaptation, augmentation, and understanding/predicting capability.
- Awesome-CV-Foundational-Models (maintained by Muhammad Awais)
- Awesome-Healthcare-Foundation-Models (maintained by Jianing Qiu)