-
CLIP, "Learning Transferable Visual Models From Natural Language Supervision", ICML 2021, [Code]
-
VLMo, "VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts", arXiv 2022. [Code]
-
BEiT3, "Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks", arXiv 2022. [Code]
-
CoCa "CoCa: Contrastive Captioners are Image-Text Foundation Models", arXiv 2022. [Code]
-
BLIP Family [Code]
- BLIP, "Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation", ICML 2022.
- BLIP2, "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models", arXiv 2023.
- InstructBLIP, "InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning", arXiv 2023.
- X-InstructBLIP, "X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning", arXiv 2023.
-
LLaVA Family
- LLaVA, "Visual Instruction Tuning", NeurIPS 2023. [Code]
- LLaVa-1.5, "Improved Baselines with Visual Instruction Tuning", arXiv 2023. [Code]
- Video-LLaVa, [Video-LLaVA: Learning United Visual Representation by Alignment Before Projection], arXiv 2023, [Code]
- MoE-LLaVA, [MoE-LLaVA: Mixture of Experts for Large Vision-Language Models], arXiv 2024. [Code]
-
MiniGPT-V Family
- MiniGPT-4, "MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models", arXiv 2023. [Code]
- MiniGPT-v2, "MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning", arXiv 2023. [Code]
- MiniGPT-5, "MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens", arXiv 2023. [code]
-
QWEN-VL, "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond", [Code]
-
Emu Family [Code]
- Emu1, "Generative Pretraining in Multimodality", arXiv 2023.
- Emu2, "Generative Multimodal Models are In-Context Learners" arXiv 2023.
-
Yi-VL, [Code]
-
Ferret "Ferret: Refer and Ground Anything Anywhere at Any Granularity", arXiv 2023. [Code]
-
CogVLM, "CogVLM: Visual Expert for Pretrained Language Models", arXiv 2023. [Code]
-
VTimeLLM, "VTimeLLM: Empower LLM to Grasp Video Moments", arXiv 2023. [Code]
-
TimeChat, "TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding", arXiv 2023. [Code]
-
InternLM-Xcomposer, "InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition", arXiv 2023. [Code]
-
VisCPM, "Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages", arXiv 2023. [Code]
-
LWM. "World Model on Million-Length Video And Language With RingAttention ", arXiv 2024. [Code]