Benchmark and Evaluations, RL Alignment, Applications, and Challenges of Large Vision Language Models

A most Frontend Collection and survey of vision-language model papers, and models GitHub repository

Below we compile awesome papers and model and github repositories that

State-of-the-Art VLMs Collection of newest to oldest VLMs (we'll keep updating new models and benchmarks).
Evaluate VLM benchmarks and corresponding link to the works
Post-training/Alignment Newest related work for VLM alignment including RL, sft.
Applications applications of VLMs in embodied AI, robotics, etc.
Contribute surveys, perspectives, and datasets on the above topics.

Welcome to contribute and discuss!

🤩 Papers marked with a ⭐️ are contributed by the maintainers of this repository. If you find them useful, we would greatly appreciate it if you could give the repository a star or cite our paper.

1. 📄 Paper Link/⛑️ Citation
1. 📚 SoTA VLMs
1. 🗂️ Dataset and Evaluation
- 2.1. Datasets and Evaluation for VLM
- 2.2. Benchmark Datasets, Simulators and Generative Models for Embodied VLM
1. 🔥 💑 Post-Training/Alignment 🔥
- 3.1. RL Alignment for VLM (reproducing the aha moment!)
- 3.2. Regular finetuning (SFT)
- 3.3. VLM Alignment Github
1. ⚒️ Applications
- 4.1. Embodied VLM agents
- 4.2. Generative Visual Media Applications
- 4.3. Robotics and Embodied AI
  - 4.3.1. Manipulation
  - 4.3.2. Navigation
  - 4.3.3. Human-robot Interaction
  - 4.3.4. Autonomous Driving
- 4.4. Human-Centered AI
  - 4.4.1. Web Agent
  - 4.4.2. Accessibility
  - 4.4.3. Healthcare
  - 4.4.4. Social Goodness
1. ⛑️ Challenges
- 5.1. Hallucination
- 5.2. Safety
- 5.3. Fairness
- 5.4. Alignment
  - 5.4.1. Multi-modality Alignment
    - 5.4.2. Commonsense and Physics Alignment
- 5.5. Efficient Training and Fine-Tuning
- 5.6. Scarce of High-quality Dataset

1. 📚 SoTA VLMs

Model	Year	Architecture	Training Data	Parameters	Vision Encoder/Tokenizer	Pretrained Backbone Model
QWen2.5-VL	2025	Decdoer-only	Image caption, VQA, grounding agent, long video	3B/7B/72B	Redesigned ViT	Qwen2.5
Ola	2025	Decoder-only	Image/Video/Audio/Text	7B	OryxViT	Qwen-2.5-7B, SigLIP-400M, Whisper-V3-Large, BEATs-AS2M(cpt2)
Ocean-OCR	2025	Decdoer-only	Pure Text, Caption, Interleaved, OCR	3B	NaViT	Pretrained from scratch
SmolVLM	2025	Decoder-only	SmolVLM-Instruct	250M & 500M	SigLIP	SmolLM
DeepSeek-Janus-Pro	2025	Decoder-only	Undisclosed	7B	SigLIP	DeepSeek-Janus-Pro
Inst-IT	2024	Decoder-only	Inst-IT Dataset, LLaVA-NeXT-Data	7B	CLIP/Vicuna, SigLIP/Qwen2	LLaVA-NeXT
DeepSeek-VL2	2024	Decoder-only	WiT, WikiHow	4.5B x 74	SigLIP/SAMB	DeepSeekMoE
xGen-MM (BLIP-3)	2024	Decoder-only	MINT-1T, OBELICS, Caption	4B	ViT + Perceiver Resampler	Phi-3-mini
TransFusion	2024	Encoder-decoder	Undisclosed	7B	VAE Encoder	Pretrained from scratch on transformer architecture
Baichuan Ocean Mini	2024	Decoder-only	Image/Video/Audio/Text	7B	CLIP ViT-L/14	Baichuan
LLaMA 3.2-vision	2024	Decoder-only	Undisclosed	11B-90B	CLIP	LLaMA-3.1
Pixtral	2024	Decoder-only	Undisclosed	12B	CLIP ViT-L/14	Mistral Large 2
Qwen2-VL	2024	Decoder-only	Undisclosed	7B-14B	EVA-CLIP ViT-L	Qwen-2
NVLM	2024	Encoder-decoder	LAION-115M	8B-24B	Custom ViT	Qwen-2-Instruct
Emu3	2024	Decoder-only	Aquila	7B	MoVQGAN	LLaMA-2
Claude 3	2024	Decoder-only	Undisclosed	Undisclosed	Undisclosed	Undisclosed
InternVL	2023	Encoder-decoder	LAION-en, LAION- multi	7B/20B	Eva CLIP ViT-g	QLLaMA
InstructBLIP	2023	Encoder-decoder	CoCo, VQAv2	13B	ViT	Flan-T5, Vicuna
CogVLM	2023	Encoder-decoder	LAION-2B ,COYO-700M	18B	CLIP ViT-L/14	Vicuna
PaLM-E	2023	Decoder-only	All robots, WebLI	562B	ViT	PaLM
LLaVA-1.5	2023	Decoder-only	COCO	13B	CLIP ViT-L/14	Vicuna
Gemini	2023	Decoder-only	Undisclosed	Undisclosed	Undisclosed	Undisclosed
GPT-4V	2023	Decoder-only	Undisclosed	Undisclosed	Undisclosed	Undisclosed
BLIP-2	2023	Encoder-decoder	COCO, Visual Genome	7B-13B	ViT-g	Open Pretrained Transformer (OPT)
Flamingo	2022	Decoder-only	M3W, ALIGN	80B	Custom	Chinchilla
BLIP	2022	Encoder-decoder	COCO, Visual Genome	223M-400M	ViT-B/L/g	Pretrained from scratch
CLIP	2021	Encoder-decoder	400M image-text pairs	63M-355M	ViT/ResNet	Pretrained from scratch
VisualBERT	2019	Encoder-only	COCO	110M	Faster R-CNN	Pretrained from scratch

2. 🗂️ Benchmarks and Evaluation

2.1. Datasets and Evaluation for VLM

Benchmark Dataset	Domain	Metric Type	Source	Size (K)	Project
Inst-IT-Bench	Fine-grained Image and Video Understanding	Multiple Choice & LLM Eval	Human/Synthetic	2K	Github Repo
MovieChat	Video understanding	LLM Eval	Human	1K	Github Repo
PHYSBENCH	Visual math reasoning	Multiple Choice	Graduate STEM Students	100	Github Repo
MMTBench	Visual reasoning, understanding, recognition, and question answering	Multiple Choice	AI Experts	30.1	Github Repo
MM-Vet	Optical Character Recognition (OCR) / Visual reasoning	LLM Eval	Human	0.2	Github Repo
MM-En/CN	Multilingual multimodal understanding	Multiple Choice	Human	3.2	Github Repo
GQA	Visual reasoning, understanding, recognition, and question answering	Answer Matching	Seed with Synthetic	22,000	Website
VCR	Visual reasoning, understanding, recognition, and question answering	Multiple Choice	MTurks	290	Website
VQAv2	Visual reasoning, understanding, recognition, and question answering	Yes/No, Answer Matching	MTurks	1,100	Github Repo
MMMU	Visual reasoning, understanding, recognition, and question answering	Answer Matching, Multiple Choice	College Students	11.5	Website
TextVQA	Visual text understanding	Answer Matching	Expert Human	28.6	Github Repo
DocVQA	Visual text understanding	Answer Matching	CrowdSource	50	Website
MSCOCO-30K	Text-to-Image generation	BLEU, Rouge, Similarity	MTurks	30	Website
ChartQA	Chart graphic understanding	Answer Matching	CrowdSource/Synthetic	32.7	Github Repo
Perception-Test	Video understanding	Multiple Choice	CrowdSource	11.6	Github Repo
MMLU	Multimodal general intelligence	Multiple Choice	Human	15.9	Github Repo
MMStar	Multimodal general intelligence	Multiple Choice	Human	1.5	Website
VideoMME	Video understanding	Multiple Choice	Experts	2.7	Website
EgoSchem	Video understanding	Multiple Choice	Synthetic/Human	5	Website
HallusionBench	Hallucination	Yes/No	Human	1.13	Github Repo
POPE	Hallucination	Yes/No	Human	9	Github Repo
CHAIR	Hallucination	Yes/No	Human	124	Github Repo
MHalDetect	Hallucination	Answer Matching	Human	4	Github Repo
Hallu-Pi	Hallucination	Answer Matching	Human	1.260	Github Repo
HallE-Control	Hallucination	Yes/No	Human	108	Github Repo
AutoHallusion	Hallucination	Answer Matching	Synthetic	3.129	Github Repo
BEAF	Hallucination	Yes/No	Human	26	Github Repo
GAIVE	Hallucination	Answer Matching	Synthetic	320	Github Repo
HalEval	Hallucination	Yes/No	CrowdSource/Synthetic	2,000	Github Repo
AMBER	Hallucination	Answer Matching	Human	15.22	Github Repo
GenAI-Bench	Text-to-Image generation	Human Ratings	Human	80.0	Huggingface
NaturalBench	Multimodal general intelligence	Yes/No, Multiple Choice	Human	10.0	Huggingface
R1-Onevision	Visual reasoning, understanding, recognition	Multiple Choice	Human	155	Github Repo
VLM^2-Bench	Visual reasoning, understanding, recognition, and question answering	Answer Matching, Multiple Choice	Human	3	Website
VisualWebInstruct	Visual reasoning, understanding, recognition, and question answering	LLM Eval	Web	900	Website

2.2. Benchmark Datasets, Simulators, and Generative Models for Embodied VLM

Benchmark	Domain	Type	Project
Habitat, Habitat 2.0, Habitat 3.0	Robotics (Navigation)	Simulator + Dataset	Website
Gibson	Robotics (Navigation)	Simulator + Dataset	Website, Github Repo
iGibson1.0, iGibson2.0	Robotics (Navigation)	Simulator + Dataset	Website, Document
Isaac Gym	Robotics (Navigation)	Simulator	Website, Github Repo
Isaac Lab	Robotics (Navigation)	Simulator	Website, Github Repo
AI2THOR	Robotics (Navigation)	Simulator	Website, Github Repo
ProcTHOR	Robotics (Navigation)	Simulator + Dataset	Website, Github Repo
VirtualHome	Robotics (Navigation)	Simulator	Website, Github Repo
ThreeDWorld	Robotics (Navigation)	Simulator	Website, Github Repo
VIMA-Bench	Robotics (Manipulation)	Simulator	Website, Github Repo
VLMbench	Robotics (Manipulation)	Simulator	Github Repo
CALVIN	Robotics (Manipulation)	Simulator	Website, Github Repo
GemBench	Robotics (Manipulation)	Simulator	Website, Github Repo
WebArena	Web Agent	Simulator	Website, Github Repo
UniSim	Robotics (Manipulation)	Generative Model, World Model	Website
GAIA-1	Robotics (Automonous Driving)	Generative Model, World Model	Website
LWM	Embodied AI	Generative Model, World Model	Website, Github Repo
Genesis	Embodied AI	Generative Model, World Model	Github Repo
EMMOE	Embodied AI	Generative Model, World Model	Paper
RoboGen	Embodied AI	Generative Model, World Model	Website

3. ⚒️ Post-Training

3.1. RL Alignment for VLM

Title	Year	Paper	RL	Code
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning	2025	Paper	REINFORCE Leave-One-Out (RLOO)	Code
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment	2025	Paper	DPO	Code
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL	2025	Paper	PPO	Code
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models	2025	Paper	GRPO	Code
Unified Reward Model for Multimodal Understanding and Generation	2025	Paper	DPO	Code
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step	2025	Paper	DPO	Code

3.2. Finetuning for VLM

Title	Year	Paper	Website	Code
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning	2024	Paper	Website	Code
LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression	2024	Paper	Website	Code
ViTamin: Designing Scalable Vision Models in the Vision-Language Era	2024	Paper	Website	Code
Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model	2024	Paper	-	-
Should VLMs be Pre-trained with Image Data?	2025	Paper	-	-

3.3. VLM Alignment github

Project	Repository Link
LLaMAFactory	🔗 GitHub
MM-Eureka-Zero	🔗 GitHub
MM-RLHF	🔗 GitHub
LMM-R1	🔗 GitHub

4. ⚒️ Applications

4.1 Embodied VLM Agents

Title	Year	Paper Link
Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI	2024	Paper
ScreenAI: A Vision-Language Model for UI and Infographics Understanding	2024	Paper
ChartLlama: A Multimodal LLM for Chart Understanding and Generation	2023	Paper
SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement	2024	📄 Paper
Training a Vision Language Model as Smartphone Assistant	2024	Paper
ScreenAgent: A Vision-Language Model-Driven Computer Control Agent	2024	Paper
Embodied Vision-Language Programmer from Environmental Feedback	2024	Paper

4.2. Generative Visual Media Applications

Title	Year	Paper	Website	Code
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning	2023	📄 Paper	🌍 Website	💾 Code

4.3. Robotics and Embodied AI

Title	Year	Paper	Website	Code
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation	2024	📄 Paper	🌍 Website	-
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities	2024	📄 Paper	🌍 Website	-
Vision-language model-driven scene understanding and robotic object manipulation	2024	📄 Paper	-	-
Guiding Long-Horizon Task and Motion Planning with Vision Language Models	2024	📄 Paper	🌍 Website	-
AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers	2023	📄 Paper	🌍 Website	-
VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model	2024	📄 Paper	-	-
Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems?	2023	📄 Paper	🌍 Website	-
DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models	2024	📄 Paper	🌍 Website	-
MotionGPT: Human Motion as a Foreign Language	2023	📄 Paper	-	💾 Code
Learning Reward for Robot Skills Using Large Language Models via Self-Alignment	2024	📄 Paper	-	-
Language to Rewards for Robotic Skill Synthesis	2023	📄 Paper	🌍 Website	-
Eureka: Human-Level Reward Design via Coding Large Language Models	2023	📄 Paper	🌍 Website	-
Integrated Task and Motion Planning	2020	📄 Paper	-	-
Jailbreaking LLM-Controlled Robots	2024	📄 Paper	🌍 Website	-
Robots Enact Malignant Stereotypes	2022	📄 Paper	🌍 Website	-
LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions	2024	📄 Paper	-	-
Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics	2024	📄 Paper	🌍 Website	-
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents	2025	📄 Paper	🌍 Website	💾 Code & Dataset
Gemini Robotics: Bringing AI into the Physical World	2025	📄 Technical Report	🌍 Website	-
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation	2024	📄 Paper	🌍 Website	-
Magma: A Foundation Model for Multimodal AI Agents	2025	📄 Paper	🌍 Website	💾 Code
DayDreamer: World Models for Physical Robot Learning	2022	📄 Paper	🌍 Website	💾 Code
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models	2025	📄 Paper	-	-
RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback	2024	📄 Paper	🌍 Website	💾 Code
KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data	2024	📄 Paper	🌍 Website	💾 Code

4.3.1. Manipulation

Title	Year	Paper	Website	Code
VIMA: General Robot Manipulation with Multimodal Prompts	2022	📄 Paper	🌍 Website
Instruct2Act: Mapping Multi-Modality Instructions to Robotic Actions with Large Language Model	2023	📄 Paper	-	-
Creative Robot Tool Use with Large Language Models	2023	📄 Paper	🌍 Website	-
RoboVQA: Multimodal Long-Horizon Reasoning for Robotics	2024	📄 Paper	-	-
RT-1: Robotics Transformer for Real-World Control at Scale	2022	📄 Paper	🌍 Website	-
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control	2023	📄 Paper	🌍 Website	-
Open X-Embodiment: Robotic Learning Datasets and RT-X Models	2023	📄 Paper	🌍 Website	-
ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models	2024	📄 Paper	🌍 Website	-
AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors	2025	📄 Paper	🌍 Website	💾 Code
Masked World Models for Visual Control	2022	📄 Paper	🌍 Website	💾 Code
Multi-View Masked World Models for Visual Robotic Manipulation	2023	📄 Paper	🌍 Website	💾 Code

4.3.2. Navigation

Title	Year	Paper	Website	Code
ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings	2022	📄 Paper	-	-
LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation	2024	📄 Paper	-	-
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action	2022	📄 Paper	🌍 Website	-
NaVILA: Legged Robot Vision-Language-Action Model for Navigation	2022	📄 Paper	🌍 Website	-
VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation	2024	📄 Paper	-	-
Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning	2023	📄 Paper	🌍 Website	-
Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments	2025	📄 Paper	-	-
Navigation World Models	2024	📄 Paper	🌍 Website	-

4.3.3. Human-robot Interaction

Title	Year	Paper	Website	Code
MUTEX: Learning Unified Policies from Multimodal Task Specifications	2023	📄 Paper	🌍 Website	-
LaMI: Large Language Models for Multi-Modal Human-Robot Interaction	2024	📄 Paper	🌍 Website	-
VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models	2024	📄 Paper	-	-

4.3.4. Autonomous Driving

Title	Year	Paper	Website	Code
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models	2024	📄 Paper	🌍 Website	-
GPT-Driver: Learning to Drive with GPT	2023	📄 Paper	-	-
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving	2023	📄 Paper	🌍 Website	-
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving	2023	📄 Paper	-	-
Referring Multi-Object Tracking	2023	📄 Paper	-	💾 Code
VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision	2023	📄 Paper	-	💾 Code
MotionLM: Multi-Agent Motion Forecasting as Language Modeling	2023	📄 Paper	-	-
DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models	2023	📄 Paper	🌍 Website	-
VLP: Vision Language Planning for Autonomous Driving	2024	📄 Paper	-	-
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model	2023	📄 Paper	-	-

4.4. Human-Centered AI

Title	Year	Paper	Website	Code
DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis	2024	📄 Paper	-	💾 Code
LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration – A Robot Sous-Chef Application	2024	📄 Paper	-	-
Pretrained Language Models as Visual Planners for Human Assistance	2023	📄 Paper	-	-
Promoting AI Equity in Science: Generalized Domain Prompt Learning for Accessible VLM Research	2024	📄 Paper	-	-
Image and Data Mining in Reticular Chemistry Using GPT-4V	2023	📄 Paper	-	-

4.4.1. Web Agent

Title	Year	Paper	Website	Code
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis	2023	📄 Paper	-	-
CogAgent: A Visual Language Model for GUI Agents	2023	📄 Paper	-	💾 Code
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models	2024	📄 Paper	-	💾 Code
ShowUI: One Vision-Language-Action Model for GUI Visual Agent	2024	📄 Paper	-	💾 Code
ScreenAgent: A Vision Language Model-driven Computer Control Agent	2024	📄 Paper	-	💾 Code
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation	2024	📄 Paper	-	💾 Code

4.4.2. Accessibility

Title	Year	Paper	Website	Code
X-World: Accessibility, Vision, and Autonomy Meet	2021	📄 Paper	-	-
Context-Aware Image Descriptions for Web Accessibility	2024	📄 Paper	-	-
Improving VR Accessibility Through Automatic 360 Scene Description Using Multimodal Large Language Models	2024	📄 Paper	-	-

4.4.3. Healthcare

Title	Year	Paper	Website	Code
VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge	2024	📄 Paper	-	💾 Code
Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology	2024	📄 Paper	-	-
M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization	2023	📄 Paper	-	-
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text	2022	📄 Paper	-	💾 Code
Med-Flamingo: A Multimodal Medical Few-Shot Learner	2023	📄 Paper	-	💾 Code

4.4.4. Social Goodness

Title	Year	Paper	Website	Code
Analyzing K-12 AI Education: A Large Language Model Study of Classroom Instruction on Learning Theories, Pedagogy, Tools, and AI Literacy	2024	📄 Paper	-	-
Students Rather Than Experts: A New AI for Education Pipeline to Model More Human-Like and Personalized Early Adolescence	2024	📄 Paper	-	-
Harnessing Large Vision and Language Models in Agriculture: A Review	2024	📄 Paper	-	-
A Vision-Language Model for Predicting Potential Distribution Land of Soybean Double Cropping	2024	📄 Paper	-	-
Vision-Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models	2024	📄 Paper	-	💾 Code
DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images	2024	📄 Paper	-	-
MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models	2024	📄 Paper	-	💾 Code
Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps	2024	📄 Paper	-	💾 Code
He is Very Intelligent, She is Very Beautiful? On Mitigating Social Biases in Language Modeling and Generation	2021	📄 Paper	-	-
UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling	2024	📄 Paper	-	-

5. Challenges

5.1 Hallucination

Title	Year	Paper	Website	Code
Object Hallucination in Image Captioning	2018	📄 Paper	-	-
Evaluating Object Hallucination in Large Vision-Language Models	2023	📄 Paper	-	💾 Code
Detecting and Preventing Hallucinations in Large Vision Language Models	2023	📄 Paper	-	-
HallE-Control: Controlling Object Hallucination in Large Multimodal Models	2023	📄 Paper	-	💾 Code
Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs	2024	📄 Paper	-	💾 Code
BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models	2024	📄 Paper	🌍 Website	-
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models	2023	📄 Paper	-	💾 Code
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models	2024	📄 Paper	🌍 Website	-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	2023	📄 Paper	-	💾 Code
Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models	2024	📄 Paper	-	💾 Code
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation	2023	📄 Paper	-	💾 Code

5.2 Safety

Title	Year	Paper	Website	Code
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models	2024	📄 Paper	🌍 Website	-
Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments	2023	📄 Paper	-	-
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models	2024	📄 Paper	-	-
JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks	2024	📄 Paper	-	-
SHIELD: An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models	2024	📄 Paper	-	💾 Code
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models	2024	📄 Paper	-	-
Jailbreaking Attack against Multimodal Large Language Model	2024	📄 Paper	-	-
Embodied Red Teaming for Auditing Robotic Foundation Models	2025	📄 Paper	🌍 Website	💾 Code
Safety Guardrails for LLM-Enabled Robots	2025	📄 Paper	-	-

5.3 Fairness

Title	Year	Paper	Website	Code
Hallucination of Multimodal Large Language Models: A Survey	2024	📄 Paper	-	-
Bias and Fairness in Large Language Models: A Survey	2023	📄 Paper	-	-
Fairness and Bias in Multimodal AI: A Survey	2024	📄 Paper	-	-
Multi-Modal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision–Language Models	2023	📄 Paper	-	-
FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks	2024	📄 Paper	-	-
FairCLIP: Harnessing Fairness in Vision-Language Learning	2024	📄 Paper	-	-
FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models	2024	📄 Paper	-	-
Benchmarking Vision Language Models for Cultural Understanding	2024	📄 Paper	-	-

5.4 Alignment

5.4.1 Multi-modality Alignment

Title	Year	Paper	Website	Code
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding	2024	📄 Paper	-	-
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement	2024	📄 Paper	-	-
Assessing and Learning Alignment of Unimodal Vision and Language Models	2024	📄 Paper	🌍 Website	-
Extending Multi-modal Contrastive Representations	2023	📄 Paper	-	💾 Code
OneLLM: One Framework to Align All Modalities with Language	2023	📄 Paper	-	💾 Code
What You See is What You Read? Improving Text-Image Alignment Evaluation	2023	📄 Paper	🌍 Website	💾 Code
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning	2024	📄 Paper	🌍 Website	💾 Code

5.4.2 Commonsense and Physics Alignment

Title	Year	Paper	Website	Code
VBench: Comprehensive BenchmarkSuite for Video Generative Models	2023	📄 Paper	🌍 Website	💾 Code
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models	2024	📄 Paper	🌍 Website	💾 Code
PhysBench: Benchmarking and Enhancing VLMs for Physical World Understanding	2025	📄 Paper	🌍 Website	💾 Code
VideoPhy: Evaluating Physical Commonsense for Video Generation	2024	📄 Paper	🌍 Website	💾 Code
WorldSimBench: Towards Video Generation Models as World Simulators	2024	📄 Paper	🌍 Website	-
WorldModelBench: Judging Video Generation Models As World Models	2025	📄 Paper	🌍 Website	💾 Code
VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation	2024	📄 Paper	🌍 Website	💾 Code
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation	2025	📄 Paper	-	💾 Code
Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency	2025	📄 Paper	-	💾 Code
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding	2025	📄 Paper	-	-
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities	2024	📄 Paper	🌍 Website	💾 Code
Do generative video models understand physical principles?	2025	📄 Paper	🌍 Website	💾 Code
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation	2024	📄 Paper	🌍 Website	💾 Code
How Far is Video Generation from World Model: A Physical Law Perspective	2024	📄 Paper	🌍 Website	💾 Code

5.5 Efficient Training and Fine-Tuning

Title	Year	Paper	Website	Code
VILA: On Pre-training for Visual Language Models	2023	📄 Paper	-	-
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision	2021	📄 Paper	-	-
LoRA: Low-Rank Adaptation of Large Language Models	2021	📄 Paper	-	💾 Code
QLoRA: Efficient Finetuning of Quantized LLMs	2023	📄 Paper	-	-
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback	2022	📄 Paper	-	💾 Code
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback	2023	📄 Paper	-	-

5.6 Scarce of High-quality Dataset

Title	Year	Paper	Website	Code
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning	2024	📄 Paper	Website	💾 Code
SLIP: Self-supervision meets Language-Image Pre-training	2021	📄 Paper	-	💾 Code
Synthetic Vision: Training Vision-Language Models to Understand Physics	2024	📄 Paper	-	-
Synth2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings	2024	📄 Paper	-	-
KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data	2024	📄 Paper	-	-
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation	2024	📄 Paper	-	-

6. Citation

@misc{li2025surveystateartlarge,
      title={A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges}, 
      author={Zongxia Li and Xiyang Wu and Hongyang Du and Huy Nghiem and Guangyao Shi},
      year={2025},
      eprint={2501.02189},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.02189}, 
}

zli12321/Vision-Language-Models-Overview