Benchmark and Evaluations, RL Alignment, Applications, and Challenges of Large Vision Language Models
A most Frontend Collection and survey of vision-language model papers, and models GitHub repository
Below we compile awesome papers and model and github repositories that
- State-of-the-Art VLMs Collection of newest to oldest VLMs (we'll keep updating new models and benchmarks).
- Evaluate VLM benchmarks and corresponding link to the works
- Post-training/Alignment Newest related work for VLM alignment including RL, sft.
- Applications applications of VLMs in embodied AI, robotics, etc.
- Contribute surveys, perspectives, and datasets on the above topics.
Welcome to contribute and discuss!
π€© Papers marked with a βοΈ are contributed by the maintainers of this repository. If you find them useful, we would greatly appreciate it if you could give the repository a star or cite our paper.
-
-
π₯ π Post-Training/Alignment π₯
-
-
- 4.1. Embodied VLM agents
- 4.2. Generative Visual Media Applications
- 4.3. Robotics and Embodied AI
- 4.3.1. Manipulation
- 4.3.2. Navigation
- 4.3.3. Human-robot Interaction
- 4.3.4. Autonomous Driving
- 4.4. Human-Centered AI
- 4.4.1. Web Agent
- 4.4.2. Accessibility
- 4.4.3. Healthcare
- 4.4.4. Social Goodness
-
- 5.1. Hallucination
- 5.2. Safety
- 5.3. Fairness
- 5.4. Alignment
- 5.4.1. Multi-modality Alignment
- 5.5. Efficient Training and Fine-Tuning
- 5.6. Scarce of High-quality Dataset
Model | Year | Architecture | Training Data | Parameters | Vision Encoder/Tokenizer | Pretrained Backbone Model |
---|---|---|---|---|---|---|
QWen2.5-VL | 2025 | Decdoer-only | Image caption, VQA, grounding agent, long video | 3B/7B/72B | Redesigned ViT | Qwen2.5 |
Ola | 2025 | Decoder-only | Image/Video/Audio/Text | 7B | OryxViT | Qwen-2.5-7B, SigLIP-400M, Whisper-V3-Large, BEATs-AS2M(cpt2) |
Ocean-OCR | 2025 | Decdoer-only | Pure Text, Caption, Interleaved, OCR | 3B | NaViT | Pretrained from scratch |
SmolVLM | 2025 | Decoder-only | SmolVLM-Instruct | 250M & 500M | SigLIP | SmolLM |
DeepSeek-Janus-Pro | 2025 | Decoder-only | Undisclosed | 7B | SigLIP | DeepSeek-Janus-Pro |
Inst-IT | 2024 | Decoder-only | Inst-IT Dataset, LLaVA-NeXT-Data | 7B | CLIP/Vicuna, SigLIP/Qwen2 | LLaVA-NeXT |
DeepSeek-VL2 | 2024 | Decoder-only | WiT, WikiHow | 4.5B x 74 | SigLIP/SAMB | DeepSeekMoE |
xGen-MM (BLIP-3) | 2024 | Decoder-only | MINT-1T, OBELICS, Caption | 4B | ViT + Perceiver Resampler | Phi-3-mini |
TransFusion | 2024 | Encoder-decoder | Undisclosed | 7B | VAE Encoder | Pretrained from scratch on transformer architecture |
Baichuan Ocean Mini | 2024 | Decoder-only | Image/Video/Audio/Text | 7B | CLIP ViT-L/14 | Baichuan |
LLaMA 3.2-vision | 2024 | Decoder-only | Undisclosed | 11B-90B | CLIP | LLaMA-3.1 |
Pixtral | 2024 | Decoder-only | Undisclosed | 12B | CLIP ViT-L/14 | Mistral Large 2 |
Qwen2-VL | 2024 | Decoder-only | Undisclosed | 7B-14B | EVA-CLIP ViT-L | Qwen-2 |
NVLM | 2024 | Encoder-decoder | LAION-115M | 8B-24B | Custom ViT | Qwen-2-Instruct |
Emu3 | 2024 | Decoder-only | Aquila | 7B | MoVQGAN | LLaMA-2 |
Claude 3 | 2024 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
InternVL | 2023 | Encoder-decoder | LAION-en, LAION- multi | 7B/20B | Eva CLIP ViT-g | QLLaMA |
InstructBLIP | 2023 | Encoder-decoder | CoCo, VQAv2 | 13B | ViT | Flan-T5, Vicuna |
CogVLM | 2023 | Encoder-decoder | LAION-2B ,COYO-700M | 18B | CLIP ViT-L/14 | Vicuna |
PaLM-E | 2023 | Decoder-only | All robots, WebLI | 562B | ViT | PaLM |
LLaVA-1.5 | 2023 | Decoder-only | COCO | 13B | CLIP ViT-L/14 | Vicuna |
Gemini | 2023 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
GPT-4V | 2023 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
BLIP-2 | 2023 | Encoder-decoder | COCO, Visual Genome | 7B-13B | ViT-g | Open Pretrained Transformer (OPT) |
Flamingo | 2022 | Decoder-only | M3W, ALIGN | 80B | Custom | Chinchilla |
BLIP | 2022 | Encoder-decoder | COCO, Visual Genome | 223M-400M | ViT-B/L/g | Pretrained from scratch |
CLIP | 2021 | Encoder-decoder | 400M image-text pairs | 63M-355M | ViT/ResNet | Pretrained from scratch |
VisualBERT | 2019 | Encoder-only | COCO | 110M | Faster R-CNN | Pretrained from scratch |
Benchmark Dataset | Domain | Metric Type | Source | Size (K) | Project |
---|---|---|---|---|---|
Inst-IT-Bench | Fine-grained Image and Video Understanding | Multiple Choice & LLM Eval | Human/Synthetic | 2K | Github Repo |
MovieChat | Video understanding | LLM Eval | Human | 1K | Github Repo |
PHYSBENCH | Visual math reasoning | Multiple Choice | Graduate STEM Students | 100 | Github Repo |
MMTBench | Visual reasoning, understanding, recognition, and question answering | Multiple Choice | AI Experts | 30.1 | Github Repo |
MM-Vet | Optical Character Recognition (OCR) / Visual reasoning | LLM Eval | Human | 0.2 | Github Repo |
MM-En/CN | Multilingual multimodal understanding | Multiple Choice | Human | 3.2 | Github Repo |
GQA | Visual reasoning, understanding, recognition, and question answering | Answer Matching | Seed with Synthetic | 22,000 | Website |
VCR | Visual reasoning, understanding, recognition, and question answering | Multiple Choice | MTurks | 290 | Website |
VQAv2 | Visual reasoning, understanding, recognition, and question answering | Yes/No, Answer Matching | MTurks | 1,100 | Github Repo |
MMMU | Visual reasoning, understanding, recognition, and question answering | Answer Matching, Multiple Choice | College Students | 11.5 | Website |
TextVQA | Visual text understanding | Answer Matching | Expert Human | 28.6 | Github Repo |
DocVQA | Visual text understanding | Answer Matching | CrowdSource | 50 | Website |
MSCOCO-30K | Text-to-Image generation | BLEU, Rouge, Similarity | MTurks | 30 | Website |
ChartQA | Chart graphic understanding | Answer Matching | CrowdSource/Synthetic | 32.7 | Github Repo |
Perception-Test | Video understanding | Multiple Choice | CrowdSource | 11.6 | Github Repo |
MMLU | Multimodal general intelligence | Multiple Choice | Human | 15.9 | Github Repo |
MMStar | Multimodal general intelligence | Multiple Choice | Human | 1.5 | Website |
VideoMME | Video understanding | Multiple Choice | Experts | 2.7 | Website |
EgoSchem | Video understanding | Multiple Choice | Synthetic/Human | 5 | Website |
HallusionBench | Hallucination | Yes/No | Human | 1.13 | Github Repo |
POPE | Hallucination | Yes/No | Human | 9 | Github Repo |
CHAIR | Hallucination | Yes/No | Human | 124 | Github Repo |
MHalDetect | Hallucination | Answer Matching | Human | 4 | Github Repo |
Hallu-Pi | Hallucination | Answer Matching | Human | 1.260 | Github Repo |
HallE-Control | Hallucination | Yes/No | Human | 108 | Github Repo |
AutoHallusion | Hallucination | Answer Matching | Synthetic | 3.129 | Github Repo |
BEAF | Hallucination | Yes/No | Human | 26 | Github Repo |
GAIVE | Hallucination | Answer Matching | Synthetic | 320 | Github Repo |
HalEval | Hallucination | Yes/No | CrowdSource/Synthetic | 2,000 | Github Repo |
AMBER | Hallucination | Answer Matching | Human | 15.22 | Github Repo |
GenAI-Bench | Text-to-Image generation | Human Ratings | Human | 80.0 | Huggingface |
NaturalBench | Multimodal general intelligence | Yes/No, Multiple Choice | Human | 10.0 | Huggingface |
R1-Onevision | Visual reasoning, understanding, recognition | Multiple Choice | Human | 155 | Github Repo |
VLM^2-Bench | Visual reasoning, understanding, recognition, and question answering | Answer Matching, Multiple Choice | Human | 3 | Website |
VisualWebInstruct | Visual reasoning, understanding, recognition, and question answering | LLM Eval | Web | 900 | Website |
Benchmark | Domain | Type | Project |
---|---|---|---|
Habitat, Habitat 2.0, Habitat 3.0 | Robotics (Navigation) | Simulator + Dataset | Website |
Gibson | Robotics (Navigation) | Simulator + Dataset | Website, Github Repo |
iGibson1.0, iGibson2.0 | Robotics (Navigation) | Simulator + Dataset | Website, Document |
Isaac Gym | Robotics (Navigation) | Simulator | Website, Github Repo |
Isaac Lab | Robotics (Navigation) | Simulator | Website, Github Repo |
AI2THOR | Robotics (Navigation) | Simulator | Website, Github Repo |
ProcTHOR | Robotics (Navigation) | Simulator + Dataset | Website, Github Repo |
VirtualHome | Robotics (Navigation) | Simulator | Website, Github Repo |
ThreeDWorld | Robotics (Navigation) | Simulator | Website, Github Repo |
VIMA-Bench | Robotics (Manipulation) | Simulator | Website, Github Repo |
VLMbench | Robotics (Manipulation) | Simulator | Github Repo |
CALVIN | Robotics (Manipulation) | Simulator | Website, Github Repo |
GemBench | Robotics (Manipulation) | Simulator | Website, Github Repo |
WebArena | Web Agent | Simulator | Website, Github Repo |
UniSim | Robotics (Manipulation) | Generative Model, World Model | Website |
GAIA-1 | Robotics (Automonous Driving) | Generative Model, World Model | Website |
LWM | Embodied AI | Generative Model, World Model | Website, Github Repo |
Genesis | Embodied AI | Generative Model, World Model | Github Repo |
EMMOE | Embodied AI | Generative Model, World Model | Paper |
RoboGen | Embodied AI | Generative Model, World Model | Website |
Title | Year | Paper | RL | Code |
---|---|---|---|---|
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning | 2025 | Paper | REINFORCE Leave-One-Out (RLOO) | Code |
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment | 2025 | Paper | DPO | Code |
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL | 2025 | Paper | PPO | Code |
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models | 2025 | Paper | GRPO | Code |
Unified Reward Model for Multimodal Understanding and Generation | 2025 | Paper | DPO | Code |
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step | 2025 | Paper | DPO | Code |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | 2024 | Paper | Website | Code |
LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression | 2024 | Paper | Website | Code |
ViTamin: Designing Scalable Vision Models in the Vision-Language Era | 2024 | Paper | Website | Code |
Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model | 2024 | Paper | - | - |
Should VLMs be Pre-trained with Image Data? | 2025 | Paper | - | - |
Project | Repository Link |
---|---|
LLaMAFactory | π GitHub |
MM-Eureka-Zero | π GitHub |
MM-RLHF | π GitHub |
LMM-R1 | π GitHub |
Title | Year | Paper Link |
---|---|---|
Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI | 2024 | Paper |
ScreenAI: A Vision-Language Model for UI and Infographics Understanding | 2024 | Paper |
ChartLlama: A Multimodal LLM for Chart Understanding and Generation | 2023 | Paper |
SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement | 2024 | π Paper |
Training a Vision Language Model as Smartphone Assistant | 2024 | Paper |
ScreenAgent: A Vision-Language Model-Driven Computer Control Agent | 2024 | Paper |
Embodied Vision-Language Programmer from Environmental Feedback | 2024 | Paper |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning | 2023 | π Paper | π Website | πΎ Code |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation | 2024 | π Paper | π Website | - |
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | 2024 | π Paper | π Website | - |
Vision-language model-driven scene understanding and robotic object manipulation | 2024 | π Paper | - | - |
Guiding Long-Horizon Task and Motion Planning with Vision Language Models | 2024 | π Paper | π Website | - |
AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers | 2023 | π Paper | π Website | - |
VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model | 2024 | π Paper | - | - |
Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems? | 2023 | π Paper | π Website | - |
DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models | 2024 | π Paper | π Website | - |
MotionGPT: Human Motion as a Foreign Language | 2023 | π Paper | - | πΎ Code |
Learning Reward for Robot Skills Using Large Language Models via Self-Alignment | 2024 | π Paper | - | - |
Language to Rewards for Robotic Skill Synthesis | 2023 | π Paper | π Website | - |
Eureka: Human-Level Reward Design via Coding Large Language Models | 2023 | π Paper | π Website | - |
Integrated Task and Motion Planning | 2020 | π Paper | - | - |
Jailbreaking LLM-Controlled Robots | 2024 | π Paper | π Website | - |
Robots Enact Malignant Stereotypes | 2022 | π Paper | π Website | - |
LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions | 2024 | π Paper | - | - |
Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics | 2024 | π Paper | π Website | - |
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents | 2025 | π Paper | π Website | πΎ Code & Dataset |
Gemini Robotics: Bringing AI into the Physical World | 2025 | π Technical Report | π Website | - |
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation | 2024 | π Paper | π Website | - |
Magma: A Foundation Model for Multimodal AI Agents | 2025 | π Paper | π Website | πΎ Code |
DayDreamer: World Models for Physical Robot Learning | 2022 | π Paper | π Website | πΎ Code |
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models | 2025 | π Paper | - | - |
RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback | 2024 | π Paper | π Website | πΎ Code |
KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data | 2024 | π Paper | π Website | πΎ Code |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
VIMA: General Robot Manipulation with Multimodal Prompts | 2022 | π Paper | π Website | |
Instruct2Act: Mapping Multi-Modality Instructions to Robotic Actions with Large Language Model | 2023 | π Paper | - | - |
Creative Robot Tool Use with Large Language Models | 2023 | π Paper | π Website | - |
RoboVQA: Multimodal Long-Horizon Reasoning for Robotics | 2024 | π Paper | - | - |
RT-1: Robotics Transformer for Real-World Control at Scale | 2022 | π Paper | π Website | - |
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control | 2023 | π Paper | π Website | - |
Open X-Embodiment: Robotic Learning Datasets and RT-X Models | 2023 | π Paper | π Website | - |
ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models | 2024 | π Paper | π Website | - |
AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors | 2025 | π Paper | π Website | πΎ Code |
Masked World Models for Visual Control | 2022 | π Paper | π Website | πΎ Code |
Multi-View Masked World Models for Visual Robotic Manipulation | 2023 | π Paper | π Website | πΎ Code |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings | 2022 | π Paper | - | - |
LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation | 2024 | π Paper | - | - |
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action | 2022 | π Paper | π Website | - |
NaVILA: Legged Robot Vision-Language-Action Model for Navigation | 2022 | π Paper | π Website | - |
VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation | 2024 | π Paper | - | - |
Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning | 2023 | π Paper | π Website | - |
Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments | 2025 | π Paper | - | - |
Navigation World Models | 2024 | π Paper | π Website | - |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
MUTEX: Learning Unified Policies from Multimodal Task Specifications | 2023 | π Paper | π Website | - |
LaMI: Large Language Models for Multi-Modal Human-Robot Interaction | 2024 | π Paper | π Website | - |
VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models | 2024 | π Paper | - | - |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models | 2024 | π Paper | π Website | - |
GPT-Driver: Learning to Drive with GPT | 2023 | π Paper | - | - |
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving | 2023 | π Paper | π Website | - |
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving | 2023 | π Paper | - | - |
Referring Multi-Object Tracking | 2023 | π Paper | - | πΎ Code |
VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision | 2023 | π Paper | - | πΎ Code |
MotionLM: Multi-Agent Motion Forecasting as Language Modeling | 2023 | π Paper | - | - |
DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models | 2023 | π Paper | π Website | - |
VLP: Vision Language Planning for Autonomous Driving | 2024 | π Paper | - | - |
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model | 2023 | π Paper | - | - |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis | 2024 | π Paper | - | πΎ Code |
LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration β A Robot Sous-Chef Application | 2024 | π Paper | - | - |
Pretrained Language Models as Visual Planners for Human Assistance | 2023 | π Paper | - | - |
Promoting AI Equity in Science: Generalized Domain Prompt Learning for Accessible VLM Research | 2024 | π Paper | - | - |
Image and Data Mining in Reticular Chemistry Using GPT-4V | 2023 | π Paper | - | - |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis | 2023 | π Paper | - | - |
CogAgent: A Visual Language Model for GUI Agents | 2023 | π Paper | - | πΎ Code |
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models | 2024 | π Paper | - | πΎ Code |
ShowUI: One Vision-Language-Action Model for GUI Visual Agent | 2024 | π Paper | - | πΎ Code |
ScreenAgent: A Vision Language Model-driven Computer Control Agent | 2024 | π Paper | - | πΎ Code |
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation | 2024 | π Paper | - | πΎ Code |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
X-World: Accessibility, Vision, and Autonomy Meet | 2021 | π Paper | - | - |
Context-Aware Image Descriptions for Web Accessibility | 2024 | π Paper | - | - |
Improving VR Accessibility Through Automatic 360 Scene Description Using Multimodal Large Language Models | 2024 | π Paper | - | - |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge | 2024 | π Paper | - | πΎ Code |
Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology | 2024 | π Paper | - | - |
M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization | 2023 | π Paper | - | - |
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text | 2022 | π Paper | - | πΎ Code |
Med-Flamingo: A Multimodal Medical Few-Shot Learner | 2023 | π Paper | - | πΎ Code |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
Analyzing K-12 AI Education: A Large Language Model Study of Classroom Instruction on Learning Theories, Pedagogy, Tools, and AI Literacy | 2024 | π Paper | - | - |
Students Rather Than Experts: A New AI for Education Pipeline to Model More Human-Like and Personalized Early Adolescence | 2024 | π Paper | - | - |
Harnessing Large Vision and Language Models in Agriculture: A Review | 2024 | π Paper | - | - |
A Vision-Language Model for Predicting Potential Distribution Land of Soybean Double Cropping | 2024 | π Paper | - | - |
Vision-Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models | 2024 | π Paper | - | πΎ Code |
DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Studentsβ Hand-Drawn Math Images | 2024 | π Paper | - | - |
MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models | 2024 | π Paper | - | πΎ Code |
Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps | 2024 | π Paper | - | πΎ Code |
He is Very Intelligent, She is Very Beautiful? On Mitigating Social Biases in Language Modeling and Generation | 2021 | π Paper | - | - |
UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling | 2024 | π Paper | - | - |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
Object Hallucination in Image Captioning | 2018 | π Paper | - | - |
Evaluating Object Hallucination in Large Vision-Language Models | 2023 | π Paper | - | πΎ Code |
Detecting and Preventing Hallucinations in Large Vision Language Models | 2023 | π Paper | - | - |
HallE-Control: Controlling Object Hallucination in Large Multimodal Models | 2023 | π Paper | - | πΎ Code |
Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs | 2024 | π Paper | - | πΎ Code |
BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models | 2024 | π Paper | π Website | - |
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models | 2023 | π Paper | - | πΎ Code |
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models | 2024 | π Paper | π Website | - |
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | 2023 | π Paper | - | πΎ Code |
Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models | 2024 | π Paper | - | πΎ Code |
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation | 2023 | π Paper | - | πΎ Code |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models | 2024 | π Paper | π Website | - |
Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments | 2023 | π Paper | - | - |
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models | 2024 | π Paper | - | - |
JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks | 2024 | π Paper | - | - |
SHIELD: An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models | 2024 | π Paper | - | πΎ Code |
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models | 2024 | π Paper | - | - |
Jailbreaking Attack against Multimodal Large Language Model | 2024 | π Paper | - | - |
Embodied Red Teaming for Auditing Robotic Foundation Models | 2025 | π Paper | π Website | πΎ Code |
Safety Guardrails for LLM-Enabled Robots | 2025 | π Paper | - | - |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
Hallucination of Multimodal Large Language Models: A Survey | 2024 | π Paper | - | - |
Bias and Fairness in Large Language Models: A Survey | 2023 | π Paper | - | - |
Fairness and Bias in Multimodal AI: A Survey | 2024 | π Paper | - | - |
Multi-Modal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in VisionβLanguage Models | 2023 | π Paper | - | - |
FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks | 2024 | π Paper | - | - |
FairCLIP: Harnessing Fairness in Vision-Language Learning | 2024 | π Paper | - | - |
FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models | 2024 | π Paper | - | - |
Benchmarking Vision Language Models for Cultural Understanding | 2024 | π Paper | - | - |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding | 2024 | π Paper | - | - |
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement | 2024 | π Paper | - | - |
Assessing and Learning Alignment of Unimodal Vision and Language Models | 2024 | π Paper | π Website | - |
Extending Multi-modal Contrastive Representations | 2023 | π Paper | - | πΎ Code |
OneLLM: One Framework to Align All Modalities with Language | 2023 | π Paper | - | πΎ Code |
What You See is What You Read? Improving Text-Image Alignment Evaluation | 2023 | π Paper | π Website | πΎ Code |
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning | 2024 | π Paper | π Website | πΎ Code |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
VBench: Comprehensive BenchmarkSuite for Video Generative Models | 2023 | π Paper | π Website | πΎ Code |
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models | 2024 | π Paper | π Website | πΎ Code |
PhysBench: Benchmarking and Enhancing VLMs for Physical World Understanding | 2025 | π Paper | π Website | πΎ Code |
VideoPhy: Evaluating Physical Commonsense for Video Generation | 2024 | π Paper | π Website | πΎ Code |
WorldSimBench: Towards Video Generation Models as World Simulators | 2024 | π Paper | π Website | - |
WorldModelBench: Judging Video Generation Models As World Models | 2025 | π Paper | π Website | πΎ Code |
VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation | 2024 | π Paper | π Website | πΎ Code |
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation | 2025 | π Paper | - | πΎ Code |
Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency | 2025 | π Paper | - | πΎ Code |
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding | 2025 | π Paper | - | - |
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | 2024 | π Paper | π Website | πΎ Code |
Do generative video models understand physical principles? | 2025 | π Paper | π Website | πΎ Code |
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation | 2024 | π Paper | π Website | πΎ Code |
How Far is Video Generation from World Model: A Physical Law Perspective | 2024 | π Paper | π Website | πΎ Code |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
VILA: On Pre-training for Visual Language Models | 2023 | π Paper | - | - |
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision | 2021 | π Paper | - | - |
LoRA: Low-Rank Adaptation of Large Language Models | 2021 | π Paper | - | πΎ Code |
QLoRA: Efficient Finetuning of Quantized LLMs | 2023 | π Paper | - | - |
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback | 2022 | π Paper | - | πΎ Code |
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback | 2023 | π Paper | - | - |
Title | Year | Paper | Website | Code |
---|---|---|---|---|
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | 2024 | π Paper | Website | πΎ Code |
SLIP: Self-supervision meets Language-Image Pre-training | 2021 | π Paper | - | πΎ Code |
Synthetic Vision: Training Vision-Language Models to Understand Physics | 2024 | π Paper | - | - |
Synth2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings | 2024 | π Paper | - | - |
KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data | 2024 | π Paper | - | - |
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation | 2024 | π Paper | - | - |
@misc{li2025surveystateartlarge,
title={A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges},
author={Zongxia Li and Xiyang Wu and Hongyang Du and Huy Nghiem and Guangyao Shi},
year={2025},
eprint={2501.02189},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.02189},
}