mllm

There are 128 repositories under mllm topic.

microsoft/unilm
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
Language:Python21k 304 1.4k2.6k
X-PLUG/MobileAgent
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
Language:Python3.9k 60 105387
NExT-GPT/NExT-GPT
Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model
Language:Python3.5k 60 112348
ant-research/MagicQuill
[CVPR'25] Official Implementations for Paper - MagicQuill: An Intelligent Interactive Image Editing System
Language:Python3.2k 38 118324
atfortes/Awesome-LLM-Reasoning
Reasoning in LLMs: Papers and Resources, including Chain-of-Thought, OpenAI o1, and DeepSeek-R1 🍓
2.9k 47 5162
InternLM/InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Language:Python2.8k 43 424170
manycore-research/SpatialLM
SpatialLM: Large Language Model for Spatial Understanding
Language:Python2.4k166
X-PLUG/mPLUG-DocOwl
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Language:Python2.1k 32 129127
cambrian-mllm/cambrian
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
Language:Python1.9k 23 75129
simular-ai/Agent-S
[ICLR 2025] Agent S: an open agentic framework that uses computers like a human
Language:Python1.4k 26 22158
SkyworkAI/Skywork-R1V
Pioneering Multimodal Reasoning with CoT
Language:Python1k 34 892
BAAI-DCAI/Bunny
A family of lightweight multimodal models.
Language:Python1k 19 13575
magic-research/Sa2VA
🔥 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Language:Python989 23 5265
CircleRadon/Osprey
[CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"
Language:Python809 14 4643
NVlabs/EAGLE
Eagle Family: Exploring Model Designs, Data Recipes and Training Strategies for Frontier-Class Multimodal LLMs
Language:Python637 27 2639
BradyFU/Woodpecker
✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models
Language:Python633 16 1331
taco-group/OpenEMMA
OpenEMMA, a permissively licensed open source "reproduction" of Waymo’s EMMA model.
Language:Python581 7 1874
FoundationVision/Groma
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
Language:Python555 27 4244
SkyworkAI/Vitron
NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Language:Python515 15 2430
gokayfem/ComfyUI_VLM_nodes
Custom ComfyUI nodes for Vision Language Models, Large Language Models, Image to Music, Text to Music, Consistent and Random Creative Prompt Generation
Language:Python481 7 12346
YingqingHe/Awesome-LLMs-meet-Multimodal-Generation
🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
Language:HTML444 15 526
Coobiw/MPP-LLaVA
Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train your own 8B/14B LLaVA-training-like MLLM on RTX3090/4090 24GB.
Language:Jupyter Notebook430 6 3623
dvlab-research/LLMGA
This project is the official implementation of 'LLMGA: Multimodal Large Language Model based Generation Assistant', ECCV2024 Oral
Language:Python391 8 524
baaivision/EVE
EVE Series: Encoder-Free Vision-Language Models from BAAI
Language:Python313 10 208
Atomic-man007/Awesome_Multimodel_LLM
Awesome_Multimodel is a curated GitHub repository that provides a comprehensive collection of resources for Multimodal Large Language Models (MLLM). It covers datasets, tuning techniques, in-context learning, visual reasoning, foundational models, and more. Stay updated with the latest advancement.
309 8 321
X-PLUG/Youku-mPLUG
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks
Language:Python294 6 3211
VITA-MLLM/Long-VITA
✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
Language:Python265 15 829
ZebangCheng/Emotion-LLaMA
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
Language:Python252 6 5324
CircleRadon/TokenPacker
The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".
Language:Python241 8 209
X-PLUG/mPLUG-2
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video (ICML 2023)
Language:Python223 4 2519
TIGER-AI-Lab/Mantis
Official code for Paper "Mantis: Multi-Image Instruction Tuning" [TMLR2024]
Language:Python208 9 2421
bz-lab/AUITestAgent
AUITestAgent is the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification.
202 3 1317
Yui010206/SeViLA
[NeurIPS 2023] Self-Chained Image-Language Model for Video Localization and Question Answering
Language:Python187 2 2722
DAMO-NLP-SG/VideoRefer
[CVPR 2025] The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"
Language:Python178 8 59
zhipeixu/FakeShield
🔥 [ICLR 2025] FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models
Language:Python177 8 2717
IDEA-Research/ChatRex
Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Language:Python170 4 158

mllm

microsoft/unilm

X-PLUG/MobileAgent

NExT-GPT/NExT-GPT

ant-research/MagicQuill

atfortes/Awesome-LLM-Reasoning

InternLM/InternLM-XComposer

manycore-research/SpatialLM

X-PLUG/mPLUG-DocOwl

cambrian-mllm/cambrian

simular-ai/Agent-S

SkyworkAI/Skywork-R1V

BAAI-DCAI/Bunny

magic-research/Sa2VA

CircleRadon/Osprey

NVlabs/EAGLE

BradyFU/Woodpecker

taco-group/OpenEMMA

FoundationVision/Groma

SkyworkAI/Vitron

gokayfem/ComfyUI_VLM_nodes

YingqingHe/Awesome-LLMs-meet-Multimodal-Generation

Coobiw/MPP-LLaVA

dvlab-research/LLMGA

baaivision/EVE

Atomic-man007/Awesome_Multimodel_LLM

X-PLUG/Youku-mPLUG

VITA-MLLM/Long-VITA

ZebangCheng/Emotion-LLaMA

CircleRadon/TokenPacker

X-PLUG/mPLUG-2

TIGER-AI-Lab/Mantis

bz-lab/AUITestAgent

Yui010206/SeViLA

DAMO-NLP-SG/VideoRefer

zhipeixu/FakeShield

IDEA-Research/ChatRex