mllm

There are 128 repositories under mllm topic.

  • microsoft/unilm

    Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

    Language:Python21k3041.4k2.6k
  • X-PLUG/MobileAgent

    Mobile-Agent: The Powerful Mobile Device Operation Assistant Family

    Language:Python3.9k60105387
  • NExT-GPT/NExT-GPT

    Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model

    Language:Python3.5k60112348
  • ant-research/MagicQuill

    [CVPR'25] Official Implementations for Paper - MagicQuill: An Intelligent Interactive Image Editing System

    Language:Python3.2k38118324
  • atfortes/Awesome-LLM-Reasoning

    Reasoning in LLMs: Papers and Resources, including Chain-of-Thought, OpenAI o1, and DeepSeek-R1 🍓

  • InternLM/InternLM-XComposer

    InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

    Language:Python2.8k43424170
  • manycore-research/SpatialLM

    SpatialLM: Large Language Model for Spatial Understanding

    Language:Python2.4k166
  • X-PLUG/mPLUG-DocOwl

    mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

    Language:Python2.1k32129127
  • cambrian-mllm/cambrian

    Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

    Language:Python1.9k2375129
  • simular-ai/Agent-S

    [ICLR 2025] Agent S: an open agentic framework that uses computers like a human

    Language:Python1.4k2622158
  • SkyworkAI/Skywork-R1V

    Pioneering Multimodal Reasoning with CoT

    Language:Python1k34892
  • BAAI-DCAI/Bunny

    A family of lightweight multimodal models.

    Language:Python1k1913575
  • magic-research/Sa2VA

    🔥 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

    Language:Python989235265
  • CircleRadon/Osprey

    [CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"

    Language:Python809144643
  • NVlabs/EAGLE

    Eagle Family: Exploring Model Designs, Data Recipes and Training Strategies for Frontier-Class Multimodal LLMs

    Language:Python637272639
  • BradyFU/Woodpecker

    ✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models

    Language:Python633161331
  • taco-group/OpenEMMA

    OpenEMMA, a permissively licensed open source "reproduction" of Waymo’s EMMA model.

    Language:Python58171874
  • FoundationVision/Groma

    [ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization

    Language:Python555274244
  • SkyworkAI/Vitron

    NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

    Language:Python515152430
  • gokayfem/ComfyUI_VLM_nodes

    Custom ComfyUI nodes for Vision Language Models, Large Language Models, Image to Music, Text to Music, Consistent and Random Creative Prompt Generation

    Language:Python481712346
  • YingqingHe/Awesome-LLMs-meet-Multimodal-Generation

    🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).

    Language:HTML44415526
  • Coobiw/MPP-LLaVA

    Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train your own 8B/14B LLaVA-training-like MLLM on RTX3090/4090 24GB.

    Language:Jupyter Notebook43063623
  • dvlab-research/LLMGA

    This project is the official implementation of 'LLMGA: Multimodal Large Language Model based Generation Assistant', ECCV2024 Oral

    Language:Python3918524
  • baaivision/EVE

    EVE Series: Encoder-Free Vision-Language Models from BAAI

    Language:Python31310208
  • Atomic-man007/Awesome_Multimodel_LLM

    Awesome_Multimodel is a curated GitHub repository that provides a comprehensive collection of resources for Multimodal Large Language Models (MLLM). It covers datasets, tuning techniques, in-context learning, visual reasoning, foundational models, and more. Stay updated with the latest advancement.

  • X-PLUG/Youku-mPLUG

    Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks

    Language:Python29463211
  • VITA-MLLM/Long-VITA

    ✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

    Language:Python26515829
  • ZebangCheng/Emotion-LLaMA

    Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

    Language:Python25265324
  • CircleRadon/TokenPacker

    The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".

    Language:Python2418209
  • X-PLUG/mPLUG-2

    mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video (ICML 2023)

    Language:Python22342519
  • TIGER-AI-Lab/Mantis

    Official code for Paper "Mantis: Multi-Image Instruction Tuning" [TMLR2024]

    Language:Python20892421
  • bz-lab/AUITestAgent

    AUITestAgent is the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification.

  • Yui010206/SeViLA

    [NeurIPS 2023] Self-Chained Image-Language Model for Video Localization and Question Answering

    Language:Python18722722
  • DAMO-NLP-SG/VideoRefer

    [CVPR 2025] The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"

    Language:Python178859
  • zhipeixu/FakeShield

    🔥 [ICLR 2025] FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models

    Language:Python17782717
  • IDEA-Research/ChatRex

    Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

    Language:Python1704158