asuzukosi/Awesome-Foundation-Models

A curated list of foundation models for vision and language tasks

Awesome-Foundation-Models

A foundation model is a large-scale pretrained model (e.g., BERT, DALL-E, GPT-3) that can be adapted to a wide range of downstream applications. This term was first popularized by the Stanford Institute for Human-Centered Artificial Intelligence. This repository maintains a curated list of foundation models for vision and language tasks. Research papers without code are not included.

Survey

2024

Image Segmentation in Foundation Model Era: A Survey (from Beijing Institute of Technology)
Towards Vision-Language Geo-Foundation Model: A Survey (from Nanyang Technological University)
An Introduction to Vision-Language Modeling (from Meta)
The Evolution of Multimodal Model Architectures (from Purdue University)
Efficient Multimodal Large Language Models: A Survey (from Tencent)
Foundation Models for Video Understanding: A Survey (from Aalborg University)
Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond (from GigaAI)
Prospective Role of Foundation Models in Advancing Autonomous Vehicles (from Tongji University)
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey (from Northeastern University)
A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models (from Lehigh)
Large Multimodal Agents: A Survey (from CUHK)
The Uncanny Valley: A Comprehensive Analysis of Diffusion Models (from Mila)
Real-World Robot Applications of Foundation Models: A Review (from University of Tokyo)
From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities (from Shanghai AI Lab)
Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey (from JHU)

Before 2024

Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision (from SDSU)
Multimodal Foundation Models: From Specialists to General-Purpose Assistants (from Microsoft)
Towards Generalist Foundation Model for Radiology (from SJTU)
Foundational Models Defining a New Era in Vision: A Survey and Outlook (from MBZ University of AI)
Towards Generalist Biomedical AI (from Google)
A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models (from Oxford)
Large Multimodal Models: Notes on CVPR 2023 Tutorial (from Chunyuan Li, Microsoft)
A Survey on Multimodal Large Language Models (from USTC and Tencent)
Vision-Language Models for Vision Tasks: A Survey (from Nanyang Technological University)
Foundation Models for Generalist Medical Artificial Intelligence (from Stanford)
A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT
A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT
Vision-language pre-training: Basics, recent advances, and future trends
On the Opportunities and Risks of Foundation Models (This survey first popularizes the concept of foundation model; from Standford)

Papers by Date

2024

[10/10] Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations (from CUHK)
[10/04] Movie Gen: A Cast of Media Foundation Models (from Meta)
[10/02] Were RNNs All We Needed? (from Mila)
[09/30] MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning (from Apple)
[09/27] Emu3: Next-Token Prediction is All You Need (from BAAI)
[09/25] Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models (from Allen AI)
[09/18] Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution (from Alibaba)
[09/18] Moshi: a speech-text foundation model for real-time dialogue (from Kyutai)
[08/27] Diffusion Models Are Real-Time Game Engines (from Google)
[08/22] Sapiens: Foundation for Human Vision Models (from Meta)
[08/14] Imagen 3 (from Google Deepmind)
[07/31] The Llama 3 Herd of Models (from Meta)
[07/29] SAM 2: Segment Anything in Images and Videos (from Meta)
[07/24] PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects (from HUST and ByteDance)
[07/17] EVE: Unveiling Encoder-Free Vision-Language Models (from BAAI)
[07/12] Transformer Layers as Painters (from Sakana AI)
[06/24] Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs (from NYU)
[06/13] 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (from EPFL and Apple)
[06/10] Merlin: A Vision Language Foundation Model for 3D Computed Tomography (from Stanford. Code will be available.)
[06/06] Vision-LSTM: xLSTM as Generic Vision Backbone (from LSTM authors)
[05/31] MeshXL: Neural Coordinate Field for Generative 3D Foundation Models (from Fudan)
[05/22] Attention as an RNN (from Mila & Borealis AI)
[05/22] GigaPath: A whole-slide foundation model for digital pathology from real-world data (from Nature)
[05/21] BiomedParse: a biomedical foundation model for biomedical image parsing (from Microsoft)
[05/20] Octo: An Open-Source Generalist Robot Policy (from UC Berkeley)
[05/17] Observational Scaling Laws and the Predictability of Language Model Performance (fro Standford)
[05/14] Understanding the performance gap between online and offline alignment algorithms (from Google)
[05/09] Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (from Shanghai AI Lab)
[05/08] You Only Cache Once: Decoder-Decoder Architectures for Language Models
[05/06] Advancing Multimodal Medical Capabilities of Gemini (from Google)
[05/07] xLSTM: Extended Long Short-Term Memory (from Sepp Hochreiter, the author of LSTM.)
[05/03] Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
[04/30] KAN: Kolmogorov-Arnold Networks (Promising alternatives of MLPs. from MIT)
[04/26] How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites (InternVL 1.5. from Shanghai AI Lab)
[04/14] TransformerFAM: Feedback attention is working memory (from Google. Efficient attention.)
[04/10] Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention (from Google)
[04/02] Octopus v2: On-device language model for super agent (from Stanford)
[04/02] Mixture-of-Depths: Dynamically allocating compute in transformer-based language models (from Google)
[03/22] InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding (from Shanghai AI Lab)
[03/18] Arc2Face: A Foundation Model of Human Faces (from Imperial College London)
[03/14] MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training (30B parameters. from Apple)
[03/09] uniGradICON: A Foundation Model for Medical Image Registration (from UNC-Chapel Hill)
[03/05] Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (Stable Diffusion 3. from Stability AI)
[03/01] Learning and Leveraging World Models in Visual Representation Learning (from Meta)
[03/01] VisionLLaMA: A Unified LLaMA Interface for Vision Tasks (from Meituan)
[02/28] CLLMs: Consistency Large Language Models (from SJTU)
[02/27] Transparent Image Layer Diffusion using Latent Transparency (from Standford)
[02/22] MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases (from Meta)
[02/21] Beyond A∗: Better Planning with Transformers via Search Dynamics Bootstrapping (from Meta)
[02/20] Neural Network Diffusion (Generating network parameters via diffusion models. from NUS)
[02/20] VideoPrism: A Foundational Visual Encoder for Video Understanding (from Google)
[02/19] FiT: Flexible Vision Transformer for Diffusion Model (from Shanghai AI Lab)
[02/06] MobileVLM V2: Faster and Stronger Baseline for Vision Language Model (from Meituan)
[01/30] YOLO-World: Real-Time Open-Vocabulary Object Detection (from Tencent and HUST)
[01/23] Lumiere: A Space-Time Diffusion Model for Video Generation (from Google)
[01/22] CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation (from Stanford)
[01/19] Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data (from TikTok)
[01/16] SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers (from NYU)
[01/15] InstantID: Zero-shot Identity-Preserving Generation in Seconds (from Xiaohongshu)

2023

BioCLIP: A Vision Foundation Model for the Tree of Life (CVPR 2024 best student paper)
Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Mamba appears to outperform similarly-sized Transformers while scaling linearly with sequence length. from CMU)
FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects (from NVIDIA)
Tracking Everything Everywhere All at Once (from Cornell, ICCV 2023 best student paper)
Foundation Models for Generalist Geospatial Artificial Intelligence (from IBM and NASA)
LLaMA 2: Open Foundation and Fine-Tuned Chat Models (from Meta)
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition (from Shanghai AI Lab)
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World (from Shanghai AI Lab)
Meta-Transformer: A Unified Framework for Multimodal Learning (from CUHK and Shanghai AI Lab)
Retentive Network: A Successor to Transformer for Large Language Models (from Microsoft and Tsinghua University)
Neural World Models for Computer Vision (PhD Thesis of Anthony Hu from University of Cambridge)
Recognize Anything: A Strong Image Tagging Model (a strong foundation model for image tagging. from OPPO)
Towards Visual Foundation Models of Physical Scenes (describes a first step towards learning general-purpose visual representations of physical scenes using only image prediction as a training criterion; from AWS)
LIMA: Less Is More for Alignment (65B parameters, from Meta)
PaLM 2 Technical Report (from Google)
IMAGEBIND: One Embedding Space To Bind Them All (from Meta)
Visual Instruction Tuning (LLaVA, from U of Wisconsin-Madison and Microsoft)
SEEM: Segment Everything Everywhere All at Once (from University of Wisconsin-Madison, HKUST, and Microsoft)
SAM: Segment Anything (the first foundation model for image segmentation; from Meta)
SegGPT: Segmenting Everything In Context (from BAAI, ZJU, and PKU)
Images Speak in Images: A Generalist Painter for In-Context Visual Learning (from BAAI, ZJU, and PKU)
UniDector: Detecting Everything in the Open World: Towards Universal Object Detection (CVPR, from Tsinghua and BNRist)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models (from Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai AI Laboratory)
Visual Prompt Multi-Modal Tracking (from Dalian University of Technology and Peng Cheng Laboratory)
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks (from ByteDance)
EVA-CLIP: Improved Training Techniques for CLIP at Scale (from BAAI and HUST)
EVA-02: A Visual Representation for Neon Genesis (from BAAI and HUST)
EVA-01: Exploring the Limits of Masked Visual Representation Learning at Scale (CVPR, from BAAI and HUST)
LLaMA: Open and Efficient Foundation Language Models (A collection of foundation language models ranging from 7B to 65B parameters; from Meta)
The effectiveness of MAE pre-pretraining for billion-scale pretraining (from Meta)
BloombergGPT: A Large Language Model for Finance (50 billion parameters; from Bloomberg)
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (this work was coordinated by BigScience whose goal is to democratize LLMs.)
FLIP: Scaling Language-Image Pre-training via Masking (from Meta)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (from Saleforce Research)
GPT-4 Technical Report (from OpenAI)
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models (from Microsoft Research Asia)
UNINEXT: Universal Instance Perception as Object Discovery and Retrieval (a unified model for 10 instance perception tasks; CVPR, from ByteDance)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning (from Shanghai AI Lab)
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions (CVPR, from Shanghai AI Lab)
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning (from Harbin Institute of Technology and Microsoft Research Asia)

2022

BEVT: BERT Pretraining of Video Transformers (CVPR, from Shanghai Key Lab of Intelligent Information Processing)
Foundation Transformers (from Microsoft)
A Generalist Agent (known as Gato, a multi-modal, multi-task, multi-embodiment generalist agent; from DeepMind)
FIBER: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone (from Microsoft, UCLA, and New York University)
Flamingo: a Visual Language Model for Few-Shot Learning (from DeepMind)
MetaLM: Language Models are General-Purpose Interfaces (from Microsoft)
Point-E: A System for Generating 3D Point Clouds from Complex Prompts (efficient 3D object generation using a text-to-image diffusion model; from OpenAI)
Image Segmentation Using Text and Image Prompts (CVPR, from University of Göttingen)
Unifying Flow, Stereo and Depth Estimation (A unified model for three motion and 3D perception tasks; from ETH Zurich)
PaLI: A Jointly-Scaled Multilingual Language-Image Model (from Google)
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training (NeurIPS, from Nanjing University, Tencent, and Shanghai AI Lab)
SLIP: Self-supervision meets Language-Image Pre-training (ECCV, from UC Berkeley and Meta)
GLIPv2: Unifying Localization and VL Understanding (NeurIPS'22, from UW, Meta, Microsoft, and UCLA)
GLIP: Grounded Language-Image Pre-training (CVPR, from UCLA and Microsoft)
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (from Salesforce Research)
NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis (from Microsoft)
PaLM: Scaling Language Modeling with Pathways (from Google)
CoCa: Contrastive Captioners are Image-Text Foundation Models (from Google)
Parti: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (from Google)
A Unified Sequence Interface for Vision Tasks (from Google Research, Brain Team)
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (from Google)
Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models (CVPR, from Stability and Runway)
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (BIG-Bench: a 204-task extremely difficult and diverse benchmark for LLMs, 444 authors from 132 institutions)
CRIS: CLIP-Driven Referring Image Segmentation (from University of Sydney and OPPO)
Masked Autoencoders As Spatiotemporal Learners (extension of MAE to videos; NeurIPS, from Meta)
Masked Autoencoders Are Scalable Vision Learners (CVPR 2022, from FAIR)
InstructGPT: Training language models to follow instructions with human feedback (trained with humans in the loop; from OpenAI)
A Unified Sequence Interface for Vision Tasks (NeurIPS 2022, from Google)
DALL-E2: Hierarchical Text-Conditional Image Generation with CLIP Latents (from OpenAI)
Robust and Efficient Medical Imaging with Self-Supervision (from Google, Georgia Tech, and Northwestern University)
Video Swin Transformer (CVPR, from Microsoft Research Asia)
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework (ICML 2022. from Alibaba.)
Mask2Former: Masked-attention Mask Transformer for Universal Image Segmentation (CVPR 2022, from FAIR and UIUC)
FLAVA: A Foundational Language And Vision Alignment Model (CVPR, from Facebook AI Research)
Towards artificial general intelligence via a multimodal foundation model (Nature Communication, from Renmin University of China)
FILIP: Fine-Grained Interactive Language-Image Pre-Training (ICLR, from Huawei and HKUST)
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision (ICLR, from CMU and Google)
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models (from OpenAI)

2021

Unifying Vision-and-Language Tasks via Text Generation (from UNC-Chapel Hill)
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (PMLR, from Google)
UniT: Multimodal Multitask Learning with a Unified Transformer (ICCV, from FAIR)
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training (This paper presents the first large-scale Chinese multimodal pre-training model called BriVL; from Renmin University of China)
Codex: Evaluating Large Language Models Trained on Code (a GPT language model finetuned on public code from GitHub, from OpenAI and Anthropic AI)
Florence: A New Foundation Model for Computer Vision (from Microsoft)
DALL-E: Zero-Shot Text-to-Image Generation (from OpenAI)
CLIP: Learning Transferable Visual Models From Natural Language Supervision (from OpenAI)
Multimodal Few-Shot Learning with Frozen Language Models (NeurIPS, from DeepMind)
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV, from Microsoft Research Asia)
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (The first Vision Transfomer with pure self-attention blocks; ICLR, from Google)

Before 2021

GPT-3: Language Models are Few-Shot Learners (175B parameters; permits in-context learning compared with GPT-2; from OpenAI)
UNITER: UNiversal Image-TExt Representation Learning (from Microsoft)
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (from Google)
GPT-2: Language Models are Unsupervised Multitask Learners (1.5B parameters; from OpenAI)
LXMERT: Learning Cross-Modality Encoder Representations from Transformers (EMNLP, from UNC-Chapel Hill)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (from Google AI Language)
GPT: Improving Language Understanding by Generative Pre-Training (from OpenAI)
Attention Is All You Need (NeurIPS, from Google and UoT)

Papers by Topic

Large Language/Multimodal Models

LLaVA: Visual Instruction Tuning (from University of Wisconsin-Madison)
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models (from KAUST)
GPT-4 Technical Report (from OpenAI)
GPT-3: Language Models are Few-Shot Learners (175B parameters; permits in-context learning compared with GPT-2; from OpenAI)
GPT-2: Language Models are Unsupervised Multitask Learners (1.5B parameters; from OpenAI)
GPT: Improving Language Understanding by Generative Pre-Training (from OpenAI)
LLaMA 2: Open Foundation and Fine-Tuned Chat Models (from Meta)
LLaMA: Open and Efficient Foundation Language Models (models ranging from 7B to 65B parameters; from Meta)
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (from Google)

Linear Attention

Large Benchmarks

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI (from Shanghai AI Lab, 2024)
BLINK: Multimodal Large Language Models Can See but Not Perceive (multimodal benchmark. from University of Pennsylvania, 2024)
CAD-Estate: Large-scale CAD Model Annotation in RGB Videos (RGB videos with CAD annotation. from Google 2023)
ImageNet: A Large-Scale Hierarchical Image Database (vision benchmark. from Stanford, 2009)

Vision-Language Pretraining

FLIP: Scaling Language-Image Pre-training via Masking (from Meta)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (proposes a generic and efficient VLP strategy based on off-the-shelf frozen vision and language models. from Salesforce Research)
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (from Salesforce Research)
SLIP: Self-supervision meets Language-Image Pre-training (ECCV, from UC Berkeley and Meta)
GLIP: Grounded Language-Image Pre-training (CVPR, from UCLA and Microsoft)
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (PMLR, from Google)
RegionCLIP: Region-Based Language-Image Pretraining
CLIP: Learning Transferable Visual Models From Natural Language Supervision (from OpenAI)

Perception Tasks: Detection, Segmentation, and Pose Estimation

SAM 2: Segment Anything in Images and Videos (from Meta)
FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects (from NVIDIA)
SEEM: Segment Everything Everywhere All at Once (from University of Wisconsin-Madison, HKUST, and Microsoft)
SAM: Segment Anything (the first foundation model for image segmentation; from Meta)
SegGPT: Segmenting Everything In Context (from BAAI, ZJU, and PKU)

Training Efficiency

Green AI (introduces the concept of Red AI vs Green AI)
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks (the lottery ticket hypothesis, from MIT)

Towards Artificial General Intelligence (AGI)

Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models (from Huawei)

AI Safety and Responsibility

Bounding the probability of harm from an AI to create a guardrail (blog from Yoshua Bengio)
Managing Extreme AI Risks amid Rapid Progress (from Science, May 2024)

Related Awesome Repositories