mllm
There are 101 repositories under mllm topic.
microsoft/unilm
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
X-PLUG/MobileAgent
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
InternLM/InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
magic-quill/MagicQuill
Official Implementations for Paper - MagicQuill: An Intelligent Interactive Image Editing System
atfortes/Awesome-LLM-Reasoning
Reasoning in Large Language Models: Papers and Resources, including Chain-of-Thought and OpenAI o1 🍓
X-PLUG/mPLUG-DocOwl
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
cambrian-mllm/cambrian
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
BAAI-DCAI/Bunny
A family of lightweight multimodal models.
CircleRadon/Osprey
[CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"
simular-ai/Agent-S
Agent S: an open agentic framework that uses computers like a human
BradyFU/Woodpecker
✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models. The first work to correct hallucinations in MLLMs.
FoundationVision/Groma
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
NVlabs/EAGLE
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
dvlab-research/LLMGA
This project is the official implementation of 'LLMGA: Multimodal Large Language Model based Generation Assistant', ECCV2024 Oral
SkyworkAI/Vitron
NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
gokayfem/ComfyUI_VLM_nodes
Custom ComfyUI nodes for Vision Language Models, Large Language Models, Image to Music, Text to Music, Consistent and Random Creative Prompt Generation
YingqingHe/Awesome-LLMs-meet-Multimodal-Generation
🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
Coobiw/MPP-LLaVA
Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train your own 8B/14B LLaVA-training-like MLLM on RTX3090/4090 24GB.
X-PLUG/Youku-mPLUG
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks
Atomic-man007/Awesome_Multimodel_LLM
Awesome_Multimodel is a curated GitHub repository that provides a comprehensive collection of resources for Multimodal Large Language Models (MLLM). It covers datasets, tuning techniques, in-context learning, visual reasoning, foundational models, and more. Stay updated with the latest advancement.
baaivision/EVE
[NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models
CircleRadon/TokenPacker
The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".
X-PLUG/mPLUG-2
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video (ICML 2023)
TIGER-AI-Lab/Mantis
Official code for Paper "Mantis: Multi-Image Instruction Tuning" (TMLR2024)
Yui010206/SeViLA
[NeurIPS 2023] Self-Chained Image-Language Model for Video Localization and Question Answering
bz-lab/AUITestAgent
AUITestAgent is the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification.
ZebangCheng/Emotion-LLaMA
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
FoundationVision/GenerateU
[CVPR2024] Generative Region-Language Pretraining for Open-Ended Object Detection
sterzhang/image-textualization
Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)
TideDra/VL-RLHF
A RLHF Infrastructure for Vision-Language Models
DAMO-NLP-SG/VideoRefer
The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"
baaivision/DenseFusion
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
thu-ml/MMTrustEval
A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)
IDEA-Research/ChatRex
Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
lerogo/MMGenBench
Official repository of MMGenBench
zjrwtx/SFT-data-builder
利用免费的大模型api来结合你的私域数据来生成sft训练数据(妥妥白嫖)支持llamafactory等工具的训练数据格式synthetic data