multimodal-large-language-models

There are 333 repositories under multimodal-large-language-models topic.

BradyFU/Awesome-Multimodal-Large-Language-Models
:sparkles::sparkles:Latest Advances on Multimodal Large Language Models
16.3k 282 1481.1k
X-PLUG/MobileAgent
Mobile-Agent: The Powerful GUI Agent Family
Language:Python5.6k 72 167551
joanrod/star-vector
StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.
Language:Python4k 44 38217
modelscope/ms-agent
MS-Agent: Lightweight Framework for Empowering Agents with Autonomous Exploration in Complex Task Scenarios
Language:Python3.4k 45 217389
ictnlp/LLaMA-Omni
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
Language:Python3.1k 35 62216
VITA-MLLM/VITA
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Language:Python2.4k 48 124176
X-PLUG/mPLUG-DocOwl
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Language:Python2.2k 33 131130
cambrian-mllm/cambrian
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
Language:Python2k 21 83129
YangLing0818/RPG-DiffusionMaster
[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)
Language:Jupyter Notebook1.8k 24 56101
sherlockchou86/VideoPipe
A cross-platform video structuring (video analysis) framework. If you find it helpful, please give it a star: ) 跨平台的视频结构化（视频分析）框架，觉得有帮助的请给个星星 : )
Language:C++1.8k 31 37262
ByteDance-Seed/Seed1.5-VL
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
Language:Jupyter Notebook1.4k55
AIDC-AI/Ovis
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
Language:Python1.3k 20 10878
Henry-23/VideoChat
实时语音交互数字人，支持端到端语音方案（GLM-4-Voice - THG）和级联方案（ASR-LLM-TTS-THG）。可自定义形象与音色，无须训练，支持音色克隆，首包延迟低至3s。Real-time voice interactive digital human, supporting end-to-end voice solutions (GLM-4-Voice - THG) and cascaded solutions (ASR-LLM-TTS-THG). Customizable appearance and voice, supporting voice cloning, with initial package delay as low as 3s.
Language:Python1.1k 12 62142
BAAI-DCAI/Bunny
A family of lightweight multimodal models.
Language:Python1k 20 13777
X-LANCE/SLAM-LLM
Speech, Language, Audio, Music Processing with Large Language Model
Language:Python890 23 7377
richard-peng-xia/awesome-multimodal-in-medical-imaging
A collection of resources on applications of multi-modal learning in medical imaging.
823 17 575
yaotingwangofficial/Awesome-MCoT
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
808 13 1124
LLaVA-VL/LLaVA-Plus-Codebase
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
Language:Python758 11 2758
NVIDIA/audio-flamingo
PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models
Language:Python739 10 1525
AIDC-AI/Awesome-Unified-Multimodal-Models
Awesome Unified Multimodal Models
695
deepglint/unicom
Large-Scale Visual Representation Model
Language:Python694 10 3731
rese1f/MovieChat
[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Language:Python649 12 8542
MME-Benchmarks/Video-MME
✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
641 5 3625
VITA-MLLM/Woodpecker
✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models
Language:Python63831
FoundationVision/Liquid
Liquid: Language Models are Scalable and Unified Multi-modal Generators
Language:Python614 2 434
SkyworkAI/Vitron
NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Language:Python567 15 2435
ictnlp/LLaVA-Mini
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
Language:Python524 10 2828
YingqingHe/Awesome-LLMs-meet-Multimodal-Generation
🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
Language:HTML509 15 530
Coobiw/MPP-LLaVA
Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train your own 8B/14B LLaVA-training-like MLLM on RTX3090/4090 24GB.
Language:Jupyter Notebook472 6 3823
hustvl/EVF-SAM
Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"
Language:Python467 8 4520
Paranioar/Awesome_Matching_Pretraining_Transfering
The Paper List of Large Multi-Modality Model (Perception, Generation, Unification), Parameter-Efficient Finetuning, Vision-Language Pretraining, Conventional Image-Text Matching for Preliminary Insight.
428 13 549
jingyi0000/R1-VL
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Language:Python4230
AIDC-AI/Ovis-U1
An unified model that seamlessly integrates multimodal understanding, text-to-image generation, and image editing within a single powerful framework.
Language:Python41413
HenryHZY/Awesome-Multimodal-LLM
Research Trends in LLM-guided Multimodal Learning.
355 17 416
burglarhobbit/Awesome-Medical-Large-Language-Models
Curated papers on Large Language Models in Healthcare and Medical domain
353 10 243
baaivision/EVE
EVE Series: Encoder-Free Vision-Language Models from BAAI
Language:Python350 10 209

multimodal-large-language-models

BradyFU/Awesome-Multimodal-Large-Language-Models

X-PLUG/MobileAgent

joanrod/star-vector

modelscope/ms-agent

ictnlp/LLaMA-Omni

VITA-MLLM/VITA

X-PLUG/mPLUG-DocOwl

cambrian-mllm/cambrian

YangLing0818/RPG-DiffusionMaster

sherlockchou86/VideoPipe

ByteDance-Seed/Seed1.5-VL

AIDC-AI/Ovis

Henry-23/VideoChat

BAAI-DCAI/Bunny

X-LANCE/SLAM-LLM

richard-peng-xia/awesome-multimodal-in-medical-imaging

yaotingwangofficial/Awesome-MCoT

LLaVA-VL/LLaVA-Plus-Codebase

NVIDIA/audio-flamingo

AIDC-AI/Awesome-Unified-Multimodal-Models

deepglint/unicom

rese1f/MovieChat

MME-Benchmarks/Video-MME

VITA-MLLM/Woodpecker

FoundationVision/Liquid

SkyworkAI/Vitron

ictnlp/LLaVA-Mini

YingqingHe/Awesome-LLMs-meet-Multimodal-Generation

Coobiw/MPP-LLaVA

hustvl/EVF-SAM

Paranioar/Awesome_Matching_Pretraining_Transfering

jingyi0000/R1-VL

AIDC-AI/Ovis-U1

HenryHZY/Awesome-Multimodal-LLM

burglarhobbit/Awesome-Medical-Large-Language-Models

baaivision/EVE