/Vision-Language-Models-Overview

A most Frontend Collection and survey of vision-language model papers, and models GitHub repository. Continuous updates.

Benchmark and Evaluations, RL Alignment, Applications, and Challenges of Large Vision Language Models

A most Frontend Collection and survey of vision-language model papers, and models GitHub repository

Below we compile awesome papers and model and github repositories that

  • State-of-the-Art VLMs Collection of newest to oldest VLMs (we'll keep updating new models and benchmarks).
  • Evaluate VLM benchmarks and corresponding link to the works
  • Post-training/Alignment Newest related work for VLM alignment including RL, sft.
  • Applications applications of VLMs in embodied AI, robotics, etc.
  • Contribute surveys, perspectives, and datasets on the above topics.

Welcome to contribute and discuss!


🀩 Papers marked with a ⭐️ are contributed by the maintainers of this repository. If you find them useful, we would greatly appreciate it if you could give the repository a star or cite our paper.


Table of Contents


1. πŸ“š SoTA VLMs

Model Year Architecture Training Data Parameters Vision Encoder/Tokenizer Pretrained Backbone Model
QWen2.5-VL 2025 Decdoer-only Image caption, VQA, grounding agent, long video 3B/7B/72B Redesigned ViT Qwen2.5
Ola 2025 Decoder-only Image/Video/Audio/Text 7B OryxViT Qwen-2.5-7B, SigLIP-400M, Whisper-V3-Large, BEATs-AS2M(cpt2)
Ocean-OCR 2025 Decdoer-only Pure Text, Caption, Interleaved, OCR 3B NaViT Pretrained from scratch
SmolVLM 2025 Decoder-only SmolVLM-Instruct 250M & 500M SigLIP SmolLM
DeepSeek-Janus-Pro 2025 Decoder-only Undisclosed 7B SigLIP DeepSeek-Janus-Pro
Inst-IT 2024 Decoder-only Inst-IT Dataset, LLaVA-NeXT-Data 7B CLIP/Vicuna, SigLIP/Qwen2 LLaVA-NeXT
DeepSeek-VL2 2024 Decoder-only WiT, WikiHow 4.5B x 74 SigLIP/SAMB DeepSeekMoE
xGen-MM (BLIP-3) 2024 Decoder-only MINT-1T, OBELICS, Caption 4B ViT + Perceiver Resampler Phi-3-mini
TransFusion 2024 Encoder-decoder Undisclosed 7B VAE Encoder Pretrained from scratch on transformer architecture
Baichuan Ocean Mini 2024 Decoder-only Image/Video/Audio/Text 7B CLIP ViT-L/14 Baichuan
LLaMA 3.2-vision 2024 Decoder-only Undisclosed 11B-90B CLIP LLaMA-3.1
Pixtral 2024 Decoder-only Undisclosed 12B CLIP ViT-L/14 Mistral Large 2
Qwen2-VL 2024 Decoder-only Undisclosed 7B-14B EVA-CLIP ViT-L Qwen-2
NVLM 2024 Encoder-decoder LAION-115M 8B-24B Custom ViT Qwen-2-Instruct
Emu3 2024 Decoder-only Aquila 7B MoVQGAN LLaMA-2
Claude 3 2024 Decoder-only Undisclosed Undisclosed Undisclosed Undisclosed
InternVL 2023 Encoder-decoder LAION-en, LAION- multi 7B/20B Eva CLIP ViT-g QLLaMA
InstructBLIP 2023 Encoder-decoder CoCo, VQAv2 13B ViT Flan-T5, Vicuna
CogVLM 2023 Encoder-decoder LAION-2B ,COYO-700M 18B CLIP ViT-L/14 Vicuna
PaLM-E 2023 Decoder-only All robots, WebLI 562B ViT PaLM
LLaVA-1.5 2023 Decoder-only COCO 13B CLIP ViT-L/14 Vicuna
Gemini 2023 Decoder-only Undisclosed Undisclosed Undisclosed Undisclosed
GPT-4V 2023 Decoder-only Undisclosed Undisclosed Undisclosed Undisclosed
BLIP-2 2023 Encoder-decoder COCO, Visual Genome 7B-13B ViT-g Open Pretrained Transformer (OPT)
Flamingo 2022 Decoder-only M3W, ALIGN 80B Custom Chinchilla
BLIP 2022 Encoder-decoder COCO, Visual Genome 223M-400M ViT-B/L/g Pretrained from scratch
CLIP 2021 Encoder-decoder 400M image-text pairs 63M-355M ViT/ResNet Pretrained from scratch
VisualBERT 2019 Encoder-only COCO 110M Faster R-CNN Pretrained from scratch

2. πŸ—‚οΈ Benchmarks and Evaluation

2.1. Datasets and Evaluation for VLM

Benchmark Dataset Domain Metric Type Source Size (K) Project
Inst-IT-Bench Fine-grained Image and Video Understanding Multiple Choice & LLM Eval Human/Synthetic 2K Github Repo
MovieChat Video understanding LLM Eval Human 1K Github Repo
PHYSBENCH Visual math reasoning Multiple Choice Graduate STEM Students 100 Github Repo
MMTBench Visual reasoning, understanding, recognition, and question answering Multiple Choice AI Experts 30.1 Github Repo
MM-Vet Optical Character Recognition (OCR) / Visual reasoning LLM Eval Human 0.2 Github Repo
MM-En/CN Multilingual multimodal understanding Multiple Choice Human 3.2 Github Repo
GQA Visual reasoning, understanding, recognition, and question answering Answer Matching Seed with Synthetic 22,000 Website
VCR Visual reasoning, understanding, recognition, and question answering Multiple Choice MTurks 290 Website
VQAv2 Visual reasoning, understanding, recognition, and question answering Yes/No, Answer Matching MTurks 1,100 Github Repo
MMMU Visual reasoning, understanding, recognition, and question answering Answer Matching, Multiple Choice College Students 11.5 Website
TextVQA Visual text understanding Answer Matching Expert Human 28.6 Github Repo
DocVQA Visual text understanding Answer Matching CrowdSource 50 Website
MSCOCO-30K Text-to-Image generation BLEU, Rouge, Similarity MTurks 30 Website
ChartQA Chart graphic understanding Answer Matching CrowdSource/Synthetic 32.7 Github Repo
Perception-Test Video understanding Multiple Choice CrowdSource 11.6 Github Repo
MMLU Multimodal general intelligence Multiple Choice Human 15.9 Github Repo
MMStar Multimodal general intelligence Multiple Choice Human 1.5 Website
VideoMME Video understanding Multiple Choice Experts 2.7 Website
EgoSchem Video understanding Multiple Choice Synthetic/Human 5 Website
HallusionBench Hallucination Yes/No Human 1.13 Github Repo
POPE Hallucination Yes/No Human 9 Github Repo
CHAIR Hallucination Yes/No Human 124 Github Repo
MHalDetect Hallucination Answer Matching Human 4 Github Repo
Hallu-Pi Hallucination Answer Matching Human 1.260 Github Repo
HallE-Control Hallucination Yes/No Human 108 Github Repo
AutoHallusion Hallucination Answer Matching Synthetic 3.129 Github Repo
BEAF Hallucination Yes/No Human 26 Github Repo
GAIVE Hallucination Answer Matching Synthetic 320 Github Repo
HalEval Hallucination Yes/No CrowdSource/Synthetic 2,000 Github Repo
AMBER Hallucination Answer Matching Human 15.22 Github Repo
GenAI-Bench Text-to-Image generation Human Ratings Human 80.0 Huggingface
NaturalBench Multimodal general intelligence Yes/No, Multiple Choice Human 10.0 Huggingface
R1-Onevision Visual reasoning, understanding, recognition Multiple Choice Human 155 Github Repo
VLM^2-Bench Visual reasoning, understanding, recognition, and question answering Answer Matching, Multiple Choice Human 3 Website
VisualWebInstruct Visual reasoning, understanding, recognition, and question answering LLM Eval Web 900 Website

2.2. Benchmark Datasets, Simulators, and Generative Models for Embodied VLM

Benchmark Domain Type Project
Habitat, Habitat 2.0, Habitat 3.0 Robotics (Navigation) Simulator + Dataset Website
Gibson Robotics (Navigation) Simulator + Dataset Website, Github Repo
iGibson1.0, iGibson2.0 Robotics (Navigation) Simulator + Dataset Website, Document
Isaac Gym Robotics (Navigation) Simulator Website, Github Repo
Isaac Lab Robotics (Navigation) Simulator Website, Github Repo
AI2THOR Robotics (Navigation) Simulator Website, Github Repo
ProcTHOR Robotics (Navigation) Simulator + Dataset Website, Github Repo
VirtualHome Robotics (Navigation) Simulator Website, Github Repo
ThreeDWorld Robotics (Navigation) Simulator Website, Github Repo
VIMA-Bench Robotics (Manipulation) Simulator Website, Github Repo
VLMbench Robotics (Manipulation) Simulator Github Repo
CALVIN Robotics (Manipulation) Simulator Website, Github Repo
GemBench Robotics (Manipulation) Simulator Website, Github Repo
WebArena Web Agent Simulator Website, Github Repo
UniSim Robotics (Manipulation) Generative Model, World Model Website
GAIA-1 Robotics (Automonous Driving) Generative Model, World Model Website
LWM Embodied AI Generative Model, World Model Website, Github Repo
Genesis Embodied AI Generative Model, World Model Github Repo
EMMOE Embodied AI Generative Model, World Model Paper
RoboGen Embodied AI Generative Model, World Model Website

3. βš’οΈ Post-Training

3.1. RL Alignment for VLM

Title Year Paper RL Code
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning 2025 Paper REINFORCE Leave-One-Out (RLOO) Code
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment 2025 Paper DPO Code
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL 2025 Paper PPO Code
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models 2025 Paper GRPO Code
Unified Reward Model for Multimodal Understanding and Generation 2025 Paper DPO Code
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step 2025 Paper DPO Code

3.2. Finetuning for VLM

Title Year Paper Website Code
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning 2024 Paper Website Code
LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression 2024 Paper Website Code
ViTamin: Designing Scalable Vision Models in the Vision-Language Era 2024 Paper Website Code
Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model 2024 Paper - -
Should VLMs be Pre-trained with Image Data? 2025 Paper - -

3.3. VLM Alignment github

Project Repository Link
LLaMAFactory πŸ”— GitHub
MM-Eureka-Zero πŸ”— GitHub
MM-RLHF πŸ”— GitHub
LMM-R1 πŸ”— GitHub

4. βš’οΈ Applications

4.1 Embodied VLM Agents

Title Year Paper Link
Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI 2024 Paper
ScreenAI: A Vision-Language Model for UI and Infographics Understanding 2024 Paper
ChartLlama: A Multimodal LLM for Chart Understanding and Generation 2023 Paper
SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement 2024 πŸ“„ Paper
Training a Vision Language Model as Smartphone Assistant 2024 Paper
ScreenAgent: A Vision-Language Model-Driven Computer Control Agent 2024 Paper
Embodied Vision-Language Programmer from Environmental Feedback 2024 Paper

4.2. Generative Visual Media Applications

Title Year Paper Website Code
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning 2023 πŸ“„ Paper 🌍 Website πŸ’Ύ Code

4.3. Robotics and Embodied AI

Title Year Paper Website Code
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation 2024 πŸ“„ Paper 🌍 Website -
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities 2024 πŸ“„ Paper 🌍 Website -
Vision-language model-driven scene understanding and robotic object manipulation 2024 πŸ“„ Paper - -
Guiding Long-Horizon Task and Motion Planning with Vision Language Models 2024 πŸ“„ Paper 🌍 Website -
AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers 2023 πŸ“„ Paper 🌍 Website -
VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model 2024 πŸ“„ Paper - -
Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems? 2023 πŸ“„ Paper 🌍 Website -
DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models 2024 πŸ“„ Paper 🌍 Website -
MotionGPT: Human Motion as a Foreign Language 2023 πŸ“„ Paper - πŸ’Ύ Code
Learning Reward for Robot Skills Using Large Language Models via Self-Alignment 2024 πŸ“„ Paper - -
Language to Rewards for Robotic Skill Synthesis 2023 πŸ“„ Paper 🌍 Website -
Eureka: Human-Level Reward Design via Coding Large Language Models 2023 πŸ“„ Paper 🌍 Website -
Integrated Task and Motion Planning 2020 πŸ“„ Paper - -
Jailbreaking LLM-Controlled Robots 2024 πŸ“„ Paper 🌍 Website -
Robots Enact Malignant Stereotypes 2022 πŸ“„ Paper 🌍 Website -
LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions 2024 πŸ“„ Paper - -
Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics 2024 πŸ“„ Paper 🌍 Website -
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents 2025 πŸ“„ Paper 🌍 Website πŸ’Ύ Code & Dataset
Gemini Robotics: Bringing AI into the Physical World 2025 πŸ“„ Technical Report 🌍 Website -
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation 2024 πŸ“„ Paper 🌍 Website -
Magma: A Foundation Model for Multimodal AI Agents 2025 πŸ“„ Paper 🌍 Website πŸ’Ύ Code
DayDreamer: World Models for Physical Robot Learning 2022 πŸ“„ Paper 🌍 Website πŸ’Ύ Code
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models 2025 πŸ“„ Paper - -
RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback 2024 πŸ“„ Paper 🌍 Website πŸ’Ύ Code
KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data 2024 πŸ“„ Paper 🌍 Website πŸ’Ύ Code

4.3.1. Manipulation

Title Year Paper Website Code
VIMA: General Robot Manipulation with Multimodal Prompts 2022 πŸ“„ Paper 🌍 Website
Instruct2Act: Mapping Multi-Modality Instructions to Robotic Actions with Large Language Model 2023 πŸ“„ Paper - -
Creative Robot Tool Use with Large Language Models 2023 πŸ“„ Paper 🌍 Website -
RoboVQA: Multimodal Long-Horizon Reasoning for Robotics 2024 πŸ“„ Paper - -
RT-1: Robotics Transformer for Real-World Control at Scale 2022 πŸ“„ Paper 🌍 Website -
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control 2023 πŸ“„ Paper 🌍 Website -
Open X-Embodiment: Robotic Learning Datasets and RT-X Models 2023 πŸ“„ Paper 🌍 Website -
ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models 2024 πŸ“„ Paper 🌍 Website -
AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors 2025 πŸ“„ Paper 🌍 Website πŸ’Ύ Code
Masked World Models for Visual Control 2022 πŸ“„ Paper 🌍 Website πŸ’Ύ Code
Multi-View Masked World Models for Visual Robotic Manipulation 2023 πŸ“„ Paper 🌍 Website πŸ’Ύ Code

4.3.2. Navigation

Title Year Paper Website Code
ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings 2022 πŸ“„ Paper - -
LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation 2024 πŸ“„ Paper - -
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action 2022 πŸ“„ Paper 🌍 Website -
NaVILA: Legged Robot Vision-Language-Action Model for Navigation 2022 πŸ“„ Paper 🌍 Website -
VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation 2024 πŸ“„ Paper - -
Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning 2023 πŸ“„ Paper 🌍 Website -
Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments 2025 πŸ“„ Paper - -
Navigation World Models 2024 πŸ“„ Paper 🌍 Website -

4.3.3. Human-robot Interaction

Title Year Paper Website Code
MUTEX: Learning Unified Policies from Multimodal Task Specifications 2023 πŸ“„ Paper 🌍 Website -
LaMI: Large Language Models for Multi-Modal Human-Robot Interaction 2024 πŸ“„ Paper 🌍 Website -
VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models 2024 πŸ“„ Paper - -

4.3.4. Autonomous Driving

Title Year Paper Website Code
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models 2024 πŸ“„ Paper 🌍 Website -
GPT-Driver: Learning to Drive with GPT 2023 πŸ“„ Paper - -
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving 2023 πŸ“„ Paper 🌍 Website -
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving 2023 πŸ“„ Paper - -
Referring Multi-Object Tracking 2023 πŸ“„ Paper - πŸ’Ύ Code
VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision 2023 πŸ“„ Paper - πŸ’Ύ Code
MotionLM: Multi-Agent Motion Forecasting as Language Modeling 2023 πŸ“„ Paper - -
DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models 2023 πŸ“„ Paper 🌍 Website -
VLP: Vision Language Planning for Autonomous Driving 2024 πŸ“„ Paper - -
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model 2023 πŸ“„ Paper - -

4.4. Human-Centered AI

Title Year Paper Website Code
DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis 2024 πŸ“„ Paper - πŸ’Ύ Code
LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration – A Robot Sous-Chef Application 2024 πŸ“„ Paper - -
Pretrained Language Models as Visual Planners for Human Assistance 2023 πŸ“„ Paper - -
Promoting AI Equity in Science: Generalized Domain Prompt Learning for Accessible VLM Research 2024 πŸ“„ Paper - -
Image and Data Mining in Reticular Chemistry Using GPT-4V 2023 πŸ“„ Paper - -

4.4.1. Web Agent

Title Year Paper Website Code
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis 2023 πŸ“„ Paper - -
CogAgent: A Visual Language Model for GUI Agents 2023 πŸ“„ Paper - πŸ’Ύ Code
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models 2024 πŸ“„ Paper - πŸ’Ύ Code
ShowUI: One Vision-Language-Action Model for GUI Visual Agent 2024 πŸ“„ Paper - πŸ’Ύ Code
ScreenAgent: A Vision Language Model-driven Computer Control Agent 2024 πŸ“„ Paper - πŸ’Ύ Code
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation 2024 πŸ“„ Paper - πŸ’Ύ Code

4.4.2. Accessibility

Title Year Paper Website Code
X-World: Accessibility, Vision, and Autonomy Meet 2021 πŸ“„ Paper - -
Context-Aware Image Descriptions for Web Accessibility 2024 πŸ“„ Paper - -
Improving VR Accessibility Through Automatic 360 Scene Description Using Multimodal Large Language Models 2024 πŸ“„ Paper - -

4.4.3. Healthcare

Title Year Paper Website Code
VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge 2024 πŸ“„ Paper - πŸ’Ύ Code
Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology 2024 πŸ“„ Paper - -
M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization 2023 πŸ“„ Paper - -
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text 2022 πŸ“„ Paper - πŸ’Ύ Code
Med-Flamingo: A Multimodal Medical Few-Shot Learner 2023 πŸ“„ Paper - πŸ’Ύ Code

4.4.4. Social Goodness

Title Year Paper Website Code
Analyzing K-12 AI Education: A Large Language Model Study of Classroom Instruction on Learning Theories, Pedagogy, Tools, and AI Literacy 2024 πŸ“„ Paper - -
Students Rather Than Experts: A New AI for Education Pipeline to Model More Human-Like and Personalized Early Adolescence 2024 πŸ“„ Paper - -
Harnessing Large Vision and Language Models in Agriculture: A Review 2024 πŸ“„ Paper - -
A Vision-Language Model for Predicting Potential Distribution Land of Soybean Double Cropping 2024 πŸ“„ Paper - -
Vision-Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models 2024 πŸ“„ Paper - πŸ’Ύ Code
DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images 2024 πŸ“„ Paper - -
MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models 2024 πŸ“„ Paper - πŸ’Ύ Code
Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps 2024 πŸ“„ Paper - πŸ’Ύ Code
He is Very Intelligent, She is Very Beautiful? On Mitigating Social Biases in Language Modeling and Generation 2021 πŸ“„ Paper - -
UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling 2024 πŸ“„ Paper - -

5. Challenges

5.1 Hallucination

Title Year Paper Website Code
Object Hallucination in Image Captioning 2018 πŸ“„ Paper - -
Evaluating Object Hallucination in Large Vision-Language Models 2023 πŸ“„ Paper - πŸ’Ύ Code
Detecting and Preventing Hallucinations in Large Vision Language Models 2023 πŸ“„ Paper - -
HallE-Control: Controlling Object Hallucination in Large Multimodal Models 2023 πŸ“„ Paper - πŸ’Ύ Code
Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs 2024 πŸ“„ Paper - πŸ’Ύ Code
BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models 2024 πŸ“„ Paper 🌍 Website -
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models 2023 πŸ“„ Paper - πŸ’Ύ Code
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models 2024 πŸ“„ Paper 🌍 Website -
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning 2023 πŸ“„ Paper - πŸ’Ύ Code
Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models 2024 πŸ“„ Paper - πŸ’Ύ Code
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation 2023 πŸ“„ Paper - πŸ’Ύ Code

5.2 Safety

Title Year Paper Website Code
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models 2024 πŸ“„ Paper 🌍 Website -
Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments 2023 πŸ“„ Paper - -
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models 2024 πŸ“„ Paper - -
JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks 2024 πŸ“„ Paper - -
SHIELD: An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models 2024 πŸ“„ Paper - πŸ’Ύ Code
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models 2024 πŸ“„ Paper - -
Jailbreaking Attack against Multimodal Large Language Model 2024 πŸ“„ Paper - -
Embodied Red Teaming for Auditing Robotic Foundation Models 2025 πŸ“„ Paper 🌍 Website πŸ’Ύ Code
Safety Guardrails for LLM-Enabled Robots 2025 πŸ“„ Paper - -

5.3 Fairness

Title Year Paper Website Code
Hallucination of Multimodal Large Language Models: A Survey 2024 πŸ“„ Paper - -
Bias and Fairness in Large Language Models: A Survey 2023 πŸ“„ Paper - -
Fairness and Bias in Multimodal AI: A Survey 2024 πŸ“„ Paper - -
Multi-Modal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision–Language Models 2023 πŸ“„ Paper - -
FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks 2024 πŸ“„ Paper - -
FairCLIP: Harnessing Fairness in Vision-Language Learning 2024 πŸ“„ Paper - -
FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models 2024 πŸ“„ Paper - -
Benchmarking Vision Language Models for Cultural Understanding 2024 πŸ“„ Paper - -

5.4 Alignment

5.4.1 Multi-modality Alignment

Title Year Paper Website Code
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding 2024 πŸ“„ Paper - -
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement 2024 πŸ“„ Paper - -
Assessing and Learning Alignment of Unimodal Vision and Language Models 2024 πŸ“„ Paper 🌍 Website -
Extending Multi-modal Contrastive Representations 2023 πŸ“„ Paper - πŸ’Ύ Code
OneLLM: One Framework to Align All Modalities with Language 2023 πŸ“„ Paper - πŸ’Ύ Code
What You See is What You Read? Improving Text-Image Alignment Evaluation 2023 πŸ“„ Paper 🌍 Website πŸ’Ύ Code
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning 2024 πŸ“„ Paper 🌍 Website πŸ’Ύ Code

5.4.2 Commonsense and Physics Alignment

Title Year Paper Website Code
VBench: Comprehensive BenchmarkSuite for Video Generative Models 2023 πŸ“„ Paper 🌍 Website πŸ’Ύ Code
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models 2024 πŸ“„ Paper 🌍 Website πŸ’Ύ Code
PhysBench: Benchmarking and Enhancing VLMs for Physical World Understanding 2025 πŸ“„ Paper 🌍 Website πŸ’Ύ Code
VideoPhy: Evaluating Physical Commonsense for Video Generation 2024 πŸ“„ Paper 🌍 Website πŸ’Ύ Code
WorldSimBench: Towards Video Generation Models as World Simulators 2024 πŸ“„ Paper 🌍 Website -
WorldModelBench: Judging Video Generation Models As World Models 2025 πŸ“„ Paper 🌍 Website πŸ’Ύ Code
VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation 2024 πŸ“„ Paper 🌍 Website πŸ’Ύ Code
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation 2025 πŸ“„ Paper - πŸ’Ύ Code
Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency 2025 πŸ“„ Paper - πŸ’Ύ Code
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding 2025 πŸ“„ Paper - -
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities 2024 πŸ“„ Paper 🌍 Website πŸ’Ύ Code
Do generative video models understand physical principles? 2025 πŸ“„ Paper 🌍 Website πŸ’Ύ Code
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation 2024 πŸ“„ Paper 🌍 Website πŸ’Ύ Code
How Far is Video Generation from World Model: A Physical Law Perspective 2024 πŸ“„ Paper 🌍 Website πŸ’Ύ Code

5.5 Efficient Training and Fine-Tuning

Title Year Paper Website Code
VILA: On Pre-training for Visual Language Models 2023 πŸ“„ Paper - -
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision 2021 πŸ“„ Paper - -
LoRA: Low-Rank Adaptation of Large Language Models 2021 πŸ“„ Paper - πŸ’Ύ Code
QLoRA: Efficient Finetuning of Quantized LLMs 2023 πŸ“„ Paper - -
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback 2022 πŸ“„ Paper - πŸ’Ύ Code
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback 2023 πŸ“„ Paper - -

5.6 Scarce of High-quality Dataset

Title Year Paper Website Code
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning 2024 πŸ“„ Paper Website πŸ’Ύ Code
SLIP: Self-supervision meets Language-Image Pre-training 2021 πŸ“„ Paper - πŸ’Ύ Code
Synthetic Vision: Training Vision-Language Models to Understand Physics 2024 πŸ“„ Paper - -
Synth2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings 2024 πŸ“„ Paper - -
KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data 2024 πŸ“„ Paper - -
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation 2024 πŸ“„ Paper - -

6. Citation

@misc{li2025surveystateartlarge,
      title={A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges}, 
      author={Zongxia Li and Xiyang Wu and Hongyang Du and Huy Nghiem and Guangyao Shi},
      year={2025},
      eprint={2501.02189},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.02189}, 
}