🔥🔥🔥 A Survey on Multimodal Large Language Models
Project Page [This Page] | Paper [a new version will be updated soon]
The first survey for Multimodal Large Language Models (MLLMs). ✨
Welcome to add WeChat ID (wmd_ustc) to join our MLLM communication group! 🌟
🔥🔥🔥 MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Project Page [Leaderboards] | Paper
The first comprehensive evaluation benchmark for MLLMs. Now the leaderboards include 50+ advanced models, such as Qwen-VL-Max, Gemini Pro, and GPT-4V. ✨
If you want to add your model in our leaderboards, please feel free to email bradyfu24@gmail.com. We will update the leaderboards in time. ✨
Download MME 🌟🌟
The benchmark dataset is collected by Xiamen University for academic research only. You can email yongdongluo@stu.xmu.edu.cn to obtain the dataset, according to the following requirement.
Requirement: A real-name system is encouraged for better academic communication. Your email suffix needs to match your affiliation, such as xx@stu.xmu.edu.cn and Xiamen University. Otherwise, you need to explain why. Please include the information bellow when sending your application email.
Name: (tell us who you are.)
Affiliation: (the name/url of your university or company)
Job Title: (e.g., professor, PhD, and researcher)
Email: (your email address)
How to use: (only for non-commercial use)
🔥🔥🔥 Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper | Source Code
The first work to correct hallucinations in MLLMs. ✨
🔥🔥🔥 A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
Paper
The first technical report for Gemini vs GPT-4V. A total of 128 pages. Completed within one week of the Gemini API opening. 🌟
📑 If you find our projects helpful to your research, please consider citing:
@article{yin2023survey,
title={A Survey on Multimodal Large Language Models},
author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Li, Ke and Sun, Xing and Xu, Tong and Chen, Enhong},
journal={arXiv preprint arXiv:2306.13549},
year={2023}
}
@article{fu2023mme,
title={MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models},
author={Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and Wu, Yunsheng and Ji, Rongrong},
journal={arXiv preprint arXiv:2306.13394},
year={2023}
}
@article{yin2023woodpecker,
title={Woodpecker: Hallucination Correction for Multimodal Large Language Models},
author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Xu, Tong and Wang, Hao and Sui, Dianbo and Shen, Yunhang and Li, Ke and Sun, Xing and Chen, Enhong},
journal={arXiv preprint arXiv:2310.16045},
year={2023}
}
@article{fu2023gemini,
title={A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise},
author={Fu, Chaoyou and Zhang, Renrui and Wang, Zihan and Huang, Yubo and Zhang, Zhengye and Qiu, Longtian and Ye, Gaoxiang and Shen, Yunhang and Zhang, Mengdan and Chen, Peixian and Zhao, Sirui and Lin, Shaohui and Jiang, Deqiang and Yin, Di and Gao, Peng and Li, Ke and Li, Hongsheng and Sun, Xing},
journal={arXiv preprint arXiv:2312.12436},
year={2023}
}
Table of Contents
Title | Venue | Date | Code | Demo |
---|---|---|---|---|
Silkie: Preference Distillation for Large Visual Language Models |
arXiv | 2023-12-17 | Github | - |
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback |
arXiv | 2023-12-01 | Github | Demo |
Aligning Large Multimodal Models with Factually Augmented RLHF |
arXiv | 2023-09-25 | Github | Demo |
Title | Venue | Date | Code | Demo |
---|---|---|---|---|
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models |
arXiv | 2024-02-03 | Github | - |
VCoder: Versatile Vision Encoders for Multimodal Large Language Models |
arXiv | 2023-12-21 | Github | Local Demo |
Prompt Highlighter: Interactive Control for Multi-Modal LLMs |
arXiv | 2023-12-07 | Github | - |
Planting a SEED of Vision in Large Language Model |
arXiv | 2023-07-16 | Github | |
Can Large Pre-trained Models Help Vision Models on Perception Tasks? |
arXiv | 2023-06-01 | Github | - |
Contextual Object Detection with Multimodal Large Language Models |
arXiv | 2023-05-29 | Github | Demo |
Generating Images with Multimodal Language Models |
arXiv | 2023-05-26 | Github | - |
On Evaluating Adversarial Robustness of Large Vision-Language Models |
arXiv | 2023-05-26 | Github | - |
Grounding Language Models to Images for Multimodal Inputs and Outputs |
ICML | 2023-01-31 | Github | Demo |
Name | Paper | Link | Notes |
---|---|---|---|
ALLaVA-4V | ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model | Link | Vision and language caption and instruction dataset generated by GPT4V |
IDK | Visually Dehallucinative Instruction Generation: Know What You Don't Know | Link | Dehallucinative visual instruction for "I Know" hallucination |
CAP2QA | Visually Dehallucinative Instruction Generation | Link | Image-aligned visual instruction dataset |
M3DBench | M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts | Link | A large-scale 3D instruction tuning dataset |
ViP-LLaVA-Instruct | Making Large Multimodal Models Understand Arbitrary Visual Prompts | Link | A mixture of LLaVA-1.5 instruction data and the region-level visual prompting data |
LVIS-Instruct4V | To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | Link | A visual instruction dataset via self-instruction from GPT-4V |
ComVint | What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning | Link | A synthetic instruction dataset for complex visual reasoning |
SparklesDialogue | ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | Link | A machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns. |
StableLLaVA | StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data | Link | A cheap and effective approach to collect visual instruction tuning data |
M-HalDetect | Detecting and Preventing Hallucinations in Large Vision Language Models | Coming soon | A dataset used to train and benchmark models for hallucination detection and prevention |
MGVLID | ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning | - | A high-quality instruction-tuning dataset including image-text and region-text pairs |
BuboGPT | BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs | Link | A high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data |
SVIT | SVIT: Scaling up Visual Instruction Tuning | Link | A large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs |
mPLUG-DocOwl | mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | Link | An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding |
PF-1M | Visual Instruction Tuning with Polite Flamingo | Link | A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo. |
ChartLlama | ChartLlama: A Multimodal LLM for Chart Understanding and Generation | Link | A multi-modal instruction-tuning dataset for chart understanding and generation |
LLaVAR | LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding | Link | A visual instruction-tuning dataset for Text-rich Image Understanding |
MotionGPT | MotionGPT: Human Motion as a Foreign Language | Link | A instruction-tuning dataset including multiple human motion-related tasks |
LRV-Instruction | Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | Link | Visual instruction tuning dataset for addressing hallucination issue |
Macaw-LLM | Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | Link | A large-scale multi-modal instruction dataset in terms of multi-turn dialogue |
LAMM-Dataset | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | Link | A comprehensive multi-modal instruction tuning dataset |
Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Link | 100K high-quality video instruction dataset |
MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Link | Multimodal in-context instruction tuning |
M3IT | M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning | Link | Large-scale, broad-coverage multimodal instruction tuning dataset |
LLaVA-Med | LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | Coming soon | A large-scale, broad-coverage biomedical instruction-following dataset |
GPT4Tools | GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | Link | Tool-related instruction datasets |
MULTIS | ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst | Coming soon | Multimodal instruction tuning dataset covering 16 multimodal tasks |
DetGPT | DetGPT: Detect What You Need via Reasoning | Link | Instruction-tuning dataset with 5000 images and around 30000 query-answer pairs |
PMC-VQA | PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | Coming soon | Large-scale medical visual question-answering dataset |
VideoChat | VideoChat: Chat-Centric Video Understanding | Link | Video-centric multimodal instruction dataset |
X-LLM | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | Link | Chinese multimodal instruction dataset |
LMEye | LMEye: An Interactive Perception Network for Large Language Models | Link | A multi-modal instruction-tuning dataset |
cc-sbu-align | MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | Link | Multimodal aligned dataset for improving model's usability and generation's fluency |
LLaVA-Instruct-150K | Visual Instruction Tuning | Link | Multimodal instruction-following data generated by GPT |
MultiInstruct | MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning | Link | The first multimodal instruction tuning benchmark dataset |
Name | Paper | Link | Notes |
---|---|---|---|
MIC | MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | Link | A manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs. |
MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Link | Multimodal in-context instruction dataset |
Name | Paper | Link | Notes |
---|---|---|---|
EMER | Explainable Multimodal Emotion Reasoning | Coming soon | A benchmark dataset for explainable emotion reasoning task |
EgoCOT | EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | Coming soon | Large-scale embodied planning dataset |
VIP | Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction | Coming soon | An inference-time dataset that can be used to evaluate VideoCOT |
ScienceQA | Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | Link | Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains |
Name | Paper | Link | Notes |
---|---|---|---|
VLFeedback | Silkie: Preference Distillation for Large Visual Language Models | Link | A vision-language feedback dataset annotated by AI |
Name | Paper | Link | Notes |
---|---|---|---|
TempCompass | TempCompass: Do Video LLMs Really Understand Videos? | Link | A benchmark to evaluate the temporal perception ability of Video LLMs |
VQAv2-IDK | Visually Dehallucinative Instruction Generation: Know What You Don't Know | Link | A benchmark for assessing "I Know" visual hallucination |
Math-Vision | Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset | Link | A diverse mathematical reasoning benchmark |
CMMMU | CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark | Link | A Chinese benchmark involving reasoning and knowledge across multiple disciplines |
MMCBench | Benchmarking Large Multimodal Models against Common Corruptions | Link | A benchmark for examining self-consistency under common corruptions |
MMVP | Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs | Link | A benchmark for assessing visual capabilities |
TimeIT | TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | Link | A video instruction-tuning dataset with timestamp annotations, covering diverse time-sensitive video-understanding tasks. |
ViP-Bench | Making Large Multimodal Models Understand Arbitrary Visual Prompts | Link | A benchmark for visual prompts |
M3DBench | M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts | Link | A 3D-centric benchmark |
Video-Bench | Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models | Link | A benchmark for video-MLLM evaluation |
MLLM-Bench | MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V | Link | GPT-4V evaluation with per-sample criteria |
BenchLMM | BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | Link | A benchmark for assessment of the robustness against different image styles |
MMC-Benchmark | MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning | Link | A comprehensive human-annotated benchmark with distinct tasks evaluating reasoning capabilities over charts |
MVBench | MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | Link | A comprehensive multimodal benchmark for video understanding |
Bingo | Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges | Link | A benchmark for hallucination evaluation that focuses on two common types |
MagnifierBench | OtterHD: A High-Resolution Multi-modality Model | Link | A benchmark designed to probe models' ability of fine-grained perception |
HallusionBench | HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models | Link | An image-context reasoning benchmark for evaluation of hallucination |
PCA-EVAL | Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond | Link | A benchmark for evaluating multi-domain embodied decision-making. |
MMHal-Bench | Aligning Large Multimodal Models with Factually Augmented RLHF | Link | A benchmark for hallucination evaluation |
MathVista | MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models | Link | A benchmark that challenges both visual and math reasoning capabilities |
SparklesEval | ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | Link | A GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns based on three distinct criteria. |
ISEKAI | Link-Context Learning for Multimodal LLMs | Link | A benchmark comprising exclusively of unseen generated image-label pairs designed for link-context learning |
M-HalDetect | Detecting and Preventing Hallucinations in Large Vision Language Models | Coming soon | A dataset used to train and benchmark models for hallucination detection and prevention |
I4 | Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions | Link | A benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions |
SciGraphQA | SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs | Link | A large-scale chart-visual question-answering dataset |
MM-Vet | MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities | Link | An evaluation benchmark that examines large multimodal models on complicated multimodal tasks |
SEED-Bench | SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension | Link | A benchmark for evaluation of generative comprehension in MLLMs |
MMBench | MMBench: Is Your Multi-modal Model an All-around Player? | Link | A systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models |
Lynx | What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? | Link | A comprehensive evaluation benchmark including both image and video tasks |
GAVIE | Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | Link | A benchmark to evaluate the hallucination and instruction following ability |
MME | MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models | Link | A comprehensive MLLM Evaluation benchmark |
LVLM-eHub | LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models | Link | An evaluation platform for MLLMs |
LAMM-Benchmark | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | Link | A benchmark for evaluating the quantitative performance of MLLMs on various2D/3D vision tasks |
M3Exam | M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models | Link | A multilingual, multimodal, multilevel benchmark for evaluating MLLM |
OwlEval | mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality | Link | Dataset for evaluation on multiple capabilities |
Name | Paper | Link | Notes |
---|---|---|---|
IMAD | IMAD: IMage-Augmented multi-modal Dialogue | Link | Multimodal dialogue dataset |
Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Link | A quantitative evaluation framework for video-based dialogue models |
CLEVR-ATVC | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | Link | A synthetic multimodal fine-tuning dataset for learning to reject instructions |
Fruit-ATVC | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | Link | A manually pictured multimodal fine-tuning dataset for learning to reject instructions |
InfoSeek | Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? | Link | A VQA dataset that focuses on asking information-seeking questions |
OVEN | Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities | Link | A dataset that focuses on recognizing the Visual Entity on the Wikipedia, from images in the wild |