multimodal-large-language-models

There are 333 repositories under multimodal-large-language-models topic.

  • BradyFU/Awesome-Multimodal-Large-Language-Models

    :sparkles::sparkles:Latest Advances on Multimodal Large Language Models

  • X-PLUG/MobileAgent

    Mobile-Agent: The Powerful GUI Agent Family

    Language:Python5.6k72167551
  • joanrod/star-vector

    StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.

    Language:Python4k4438217
  • modelscope/ms-agent

    MS-Agent: Lightweight Framework for Empowering Agents with Autonomous Exploration in Complex Task Scenarios

    Language:Python3.4k45217389
  • ictnlp/LLaMA-Omni

    LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

    Language:Python3.1k3562216
  • VITA-MLLM/VITA

    ✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

    Language:Python2.4k48124176
  • X-PLUG/mPLUG-DocOwl

    mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

    Language:Python2.2k33131130
  • cambrian-mllm/cambrian

    Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

    Language:Python2k2183129
  • YangLing0818/RPG-DiffusionMaster

    [ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)

    Language:Jupyter Notebook1.8k2456101
  • sherlockchou86/VideoPipe

    A cross-platform video structuring (video analysis) framework. If you find it helpful, please give it a star: ) 跨平台的视频结构化(视频分析)框架,觉得有帮助的请给个星星 : )

    Language:C++1.8k3137262
  • ByteDance-Seed/Seed1.5-VL

    Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.

    Language:Jupyter Notebook1.4k55
  • AIDC-AI/Ovis

    A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

    Language:Python1.3k2010878
  • Henry-23/VideoChat

    实时语音交互数字人,支持端到端语音方案(GLM-4-Voice - THG)和级联方案(ASR-LLM-TTS-THG)。可自定义形象与音色,无须训练,支持音色克隆,首包延迟低至3s。Real-time voice interactive digital human, supporting end-to-end voice solutions (GLM-4-Voice - THG) and cascaded solutions (ASR-LLM-TTS-THG). Customizable appearance and voice, supporting voice cloning, with initial package delay as low as 3s.

    Language:Python1.1k1262142
  • BAAI-DCAI/Bunny

    A family of lightweight multimodal models.

    Language:Python1k2013777
  • X-LANCE/SLAM-LLM

    Speech, Language, Audio, Music Processing with Large Language Model

    Language:Python890237377
  • richard-peng-xia/awesome-multimodal-in-medical-imaging

    A collection of resources on applications of multi-modal learning in medical imaging.

  • yaotingwangofficial/Awesome-MCoT

    Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

  • LLaVA-VL/LLaVA-Plus-Codebase

    LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

    Language:Python758112758
  • NVIDIA/audio-flamingo

    PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models

    Language:Python739101525
  • AIDC-AI/Awesome-Unified-Multimodal-Models

    Awesome Unified Multimodal Models

  • deepglint/unicom

    Large-Scale Visual Representation Model

    Language:Python694103731
  • rese1f/MovieChat

    [CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

    Language:Python649128542
  • MME-Benchmarks/Video-MME

    ✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

  • VITA-MLLM/Woodpecker

    ✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models

    Language:Python63831
  • Liquid

    FoundationVision/Liquid

    Liquid: Language Models are Scalable and Unified Multi-modal Generators

    Language:Python6142434
  • SkyworkAI/Vitron

    NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

    Language:Python567152435
  • ictnlp/LLaVA-Mini

    LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

    Language:Python524102828
  • YingqingHe/Awesome-LLMs-meet-Multimodal-Generation

    🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).

    Language:HTML50915530
  • Coobiw/MPP-LLaVA

    Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train your own 8B/14B LLaVA-training-like MLLM on RTX3090/4090 24GB.

    Language:Jupyter Notebook47263823
  • hustvl/EVF-SAM

    Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"

    Language:Python46784520
  • Paranioar/Awesome_Matching_Pretraining_Transfering

    The Paper List of Large Multi-Modality Model (Perception, Generation, Unification), Parameter-Efficient Finetuning, Vision-Language Pretraining, Conventional Image-Text Matching for Preliminary Insight.

  • jingyi0000/R1-VL

    R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

    Language:Python4230
  • AIDC-AI/Ovis-U1

    An unified model that seamlessly integrates multimodal understanding, text-to-image generation, and image editing within a single powerful framework.

    Language:Python41413
  • HenryHZY/Awesome-Multimodal-LLM

    Research Trends in LLM-guided Multimodal Learning.

  • burglarhobbit/Awesome-Medical-Large-Language-Models

    Curated papers on Large Language Models in Healthcare and Medical domain

  • baaivision/EVE

    EVE Series: Encoder-Free Vision-Language Models from BAAI

    Language:Python35010209