Efficient Multimodal Large Language Models: A Survey [arXiv]
Yizhang Jin12, Jian Li1, Yexin Liu3, Tianjun Gu4, Kai Wu1, Zhengkai Jiang1, Muyang He3, Bo Zhao3, Xin Tan4, Zhenye Gan1, Yabiao Wang1, Chengjie Wang1, Lizhuang Ma2
1Tencent YouTu Lab, 2Shanghai Jiao Tong University, 3Beijing Academy of Artificial Intelligence, 4East China Normal University
@misc{jin2024efficient,
title={Efficient Multimodal Large Language Models: A Survey},
author={Yizhang Jin and Jian Li and Yexin Liu and Tianjun Gu and Kai Wu and Zhengkai Jiang and Muyang He and Bo Zhao and Xin Tan and Zhenye Gan and Yabiao Wang and Chengjie Wang and Lizhuang Ma},
year={2024},
eprint={2405.10739},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
In the past year, Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, especially in edge computing scenarios. In this survey, we provide a comprehensive and systematic review of the current state of efficient MLLMs. Specifically, we summarize the timeline of representative efficient MLLMs, research state of efficient structures and strategies, and the applications. Finally, we discuss the limitations of current efficient MLLM research and promising future directions.
Model | Vision Encoder | Resolution | Vision Encoder Parameter Size | LLM | LLM Parameter Size | Vision-LLM Projector | Timeline |
---|---|---|---|---|---|---|---|
MobileVLM | CLIP ViT-L/14 | 336 | 0.3B | MobileLLaMA | 2.7B | LDP | 2023-12 |
LLaVA-Phi | CLIP ViT-L/14 | 336 | 0.3B | Phi-2 | 2.7B | MLP | 2024-01 |
Imp-v1 | SigLIP | 384 | 0.4B | Phi-2 | 2.7B | - | 2024-02 |
TinyLLaVA | SigLIP-SO | 384 | 0.4B | Phi-2 | 2.7B | MLP | 2024-02 |
Bunny | SigLIP-SO | 384 | 0.4B | Phi-2 | 2.7B | MLP | 2024-02 |
MobileVLM-v2-3B | CLIP ViT-L/14 | 336 | 0.3B | MobileLLaMA | 2.7B | LDPv2 | 2024-02 |
MoE-LLaVA-3.6B | CLIP-Large | 384 | - | Phi-2 | 2.7B | MLP | 2024-02 |
Cobra | DINOv2, SigLIP-SO | 384 | 0.3B+0.4B | Mamba-2.8b-Zephyr | 2.8B | MLP | 2024-03 |
Mini-Gemini | CLIP-Large | 336 | - | Gemma | 2B | MLP | 2024-03 |
Vary-toy | CLIP | 224 | - | Qwen | 1.8B | - | 2024-01 |
TinyGPT-V | EVA | 224/448 | - | Phi-2 | 2.7B | Q-Former | 2024-01 |
SPHINX-Tiny | DINOv2 , CLIP-ConvNeXt | 448 | - | TinyLlama | 1.1B | - | 2024-02 |
ALLaVA-Longer | CLIP-ViT-L/14 | 336 | 0.3B | Phi-2 | 2.7B | - | 2024-02 |
MM1-3B-MoE-Chat | CLIP_DFN-ViT-H | 378 | - | - | 3B | C-Abstractor | 2024-03 |
LLaVA-Gemma | DinoV2 | - | - | Gemma-2b-it | 2B | - | 2024-03 |
Mipha-3B | SigLIP | 384 | - | Phi-2 | 2.7B | - | 2024-03 |
VL-Mamba | SigLIP-SO | 384 | - | Mamba-2.8B-Slimpj | 2.8B | VSS-L2 | 2024-03 |
MiniCPM-V 2.0 | SigLIP | - | 0.4B | MiniCPM | 2.7B | Perceiver Resampler | 2024-03 |
DeepSeek-VL | SigLIP-L | 384 | 0.4B | DeepSeek-LLM | 1.3B | MLP | 2024-03 |
KarmaVLM | SigLIP-SO | 384 | 0.4B | Qwen1.5 | 0.5B | - | 2024-02 |
moondream2 | SigLIP | - | - | Phi-1.5 | 1.3B | - | 2024-03 |
Bunny-v1.1-4B | SigLIP | 1152 | - | Phi-3-Mini-4K | 3.8B | - | 2024-02 |
⚡We will actively maintain this repository and incorporate new research as it emerges. If you have any questions, please feel free to contact swordli@tencent.com.
- Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv, 2023 [Paper]
- Llava-phi: Efficient multi-modal assistant with small language model. arXiv, 2024 [Paper]
- Imp-v1: An emprical study of multimodal small language models. arXiv, 2024 [Paper]
- TinyLLaVA: A Framework of Small-scale Large Multimodal Models. arxiv, 2024 [Paper]
- (Bunny)Efficient multimodal learning from data-centric perspective.arXiv, 2024 [Paper]
- Gemini: a family of highly capable multimodal modelsarXiv, 2023 [Paper]
- Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv, 2024 [Paper]
- Moe-llava: Mixture of experts for large vision-language models. arXiv, 2024 [Paper]
- Cobra:Extending mamba to multi-modal large language model for efficient inference. arXiv, 2024 [Paper]
- Mini-gemini: Mining the potential of multi-modality vision language models. arXiv, 2024 [Paper]
- (Vary-toy)Small language model meets with reinforced vision vocabulary. arXiv, 2024 [Paper]
- Tinygpt-v: Efficient multimodal large language model via small backbones.arXiv, 2023 [Paper]
- SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models.arXiv, 2024 [Paper]
- ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model.arXiv, 2024 [Paper]
- Mm1: Methods, analysis & insights from multimodal llm pre-training.arXiv, 2024 [Paper]
- LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model.arXiv, 2024 [Paper]
- Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models.arXiv, 2024 [Paper]
- VL-Mamba: Exploring State Space Models for Multimodal Learning.arXiv, 2024 [Paper]
- MiniCPM-V 2.0: An Efficient End-side MLLM with Strong OCR and Understanding Capabilities.github, 2024 [Github]
- DeepSeek-VL: Towards Real-World Vision-Language Understanding .arXiv, 2024 [Paper]
- KarmaVLM: A family of high efficiency and powerful visual language model.github, 2024 [Github]
- moondream: tiny vision language model.github, 2024 [Github]
- Broadening the visual encoding of vision-language models, arXiv, 2024 [Paper]
- Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference, arXiv, 2024 [Paper]
- SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models, arXiv, 2024 [Paper]
- ViTamin: Designing Scalable Vision Models in the Vision-Language Era. arXiv, 2024 [Paper]
- Visual Instruction Tuning. arXiv, 2023 [Paper]
- Improved baselines with visual instruction tuning. arXiv, 2023 [Paper]
- Flamingo: a Visual Language Model for Few-Shot Learning, arXiv, 2022 [Paper]
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, arXiv, 2023 [Paper]
- Broadening the visual encoding of vision-language models, arXiv, 2024 [Paper]
- MobileVLM V2: Faster and Stronger Baseline for Vision Language Model, arXiv, 2023 [Paper]
- Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv, 2023 [Paper]
- Vl-mamba: Exploring state space models for multimodal learning.arXiv, 2024 [Paper]
- Honeybee: Locality-enhanced projector for multimodal llm.arXiv, 2023 [Paper]
- Llama: Open and efficient foundation language models. arXiv, 2023 [Paper]
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.website, 2023 [[web](https://vicuna. lmsys. org)]
- Phi-2: The surprising power of small language models. blog 2023 [[blog](Microsoft Research Blog)]
- Gemma: Open models based on gemini research and technology. arXiv, 2024 [Paper]
- Phi-3 technical report: A highly capable language model locally on your phone. 2024
- Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images. arXiv, 2024 [Paper]
- A pioneering large vision- language model handling resolutions from 336 pixels to 4k hd. arXiv, 2024 [Paper]
- Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images. arXiv, 2024 [Paper]
- Texthawk: Exploring efficient fine-grained perception of multimodal large language models. arXiv, 2024 [Paper]
- Tiny- chart: Efficient chart understanding with visual token merging and program-of-thoughts learning.
- Llava-prumerge: Adaptive token reduction for efficient large multimodal models. arXiv, 2024 [Paper]
- Madtp: Multi- modal alignment-guided dynamic token pruning for accelerating vision-language transformer. arXiv, 2024 [Paper]
- CROSSGET: CROSS-GUIDED ENSEMBLE OF TOKENS FOR ACCELERATING VISION-LANGUAGE TRANSFORMERS. ICML, 2024 [Paper]
- Matryoshka Query Transformer for Large Vision-Language Models. arxiv, 2024 [Paper]
- Mini-gemini: Mining the potential of multi-modality vision language models. arXiv, 2024 [Paper]
- When do we not need larger vision models? arXiv, 2024 [Paper] arXiv, 2023 [Paper]
- Plug-and-play grounding of reasoning in multimodal large language models. arXiv, 2024 [Paper]
- Mova: Adapting mixture of vision experts to multimodal context. arXiv, 2024 [Paper]
- Elysium: Exploring object-level perception in videos via mllm. arXiv, 2024 [Paper]
- Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv, 2023 [Paper]
- Video-llava: Learning united visual representation by alignment before projection. arXiv, 2023 [Paper]
- Moe-llava: Mixture of experts for large vision-language models. arXiv, 2024 [Paper]
- Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv, 2024 [Paper]
- Mixtral of experts. arXiv, 2024 [Paper]
- Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference, arXiv, 2024 [Paper]
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv, 2023 [Paper]
- Vl-mamba: Exploring state space models for multimodal learning. arXiv, 2024 [Paper]
- On speculative decoding for multimodal large language models. arXiv, 2024 [Paper]
- An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. arXiv, 2024 [Paper]
- Boosting multimodal large language models with visual tokens withdrawal for rapid inference. arXiv, 2024 [Paper]
- Tinyllava: A framework of small-scale large multimodal models. arXiv, 2024 [Paper]
- Vila: On pre-training for visual language models. arXiv, 2023 [Paper]
- Sharegpt4v: Improving large multi-modal models with better captions. arXiv, 2023 [Paper]
- What matters when building vision- language models? arXiv, 2024 [Paper]
- Cheap and quick: Efficient vision-language instruction tuning for large language models. nips, 2023 [Paper]
- Hyperllava: Dynamic visual and language expert tuning for multimodal large language models. arXiv, 2024 [Paper]
- SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models.arXiv, 2024 [Paper]
- Cobra:Extending mamba to multi-modal large language model for efficient inference. arXiv, 2024 [Paper]
- Tinygpt-v: Efficient multimodal large language model via small backbones.arXiv, 2023 [Paper]
- Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models. arXiv, 2024 [Paper]
- Memory-space visual prompting for efficient vision-language fine-tuning. arXiv, 2024 [Paper]
- Training small multimodal models to bridge biomedical competency gap: A case study in radiology imaging. arXiv, 2024 [[Paper]]
- Moe-tinymed: Mixture of experts for tiny medical large vision-language models. arXiv, 2024 [Paper]
- Texthawk: Exploring efficient fine-grained perception of multimodal large language models. arXiv, 2024 [Paper]
- Tiny- chart: Efficient chart understanding with visual token merging and program-of-thoughts learning. arXiv, 2024 [Paper]
- Monkey: Image resolution and text label are important things for large multi-modal models. arXiv, 2024 [Paper]
- Hrvda: High-resolution visual document assistant. arXiv, 2023 [Paper]
- mplug-2: A modular- ized multi-modal foundation model across text, image and video. arXiv, 2023 [Paper]
- Video-llava: Learning united visual representation by alignment before projection. arXiv, 2023 [Paper]
- Ma-lmm: Memory-augmented large multimodal model for long-term video under- standing. arXiv, 2024 [Paper]
- Llama-vid: An image is worth 2 tokens in large language models. arXiv, 2023 [Paper]