multimodal

There are 1064 repositories under multimodal topic.

  • anything-llm

    Mintplex-Labs/anything-llm

    The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, No-code agent builder, MCP compatibility, and more.

    Language:JavaScript49.1k3402.8k5.1k
  • haotian-liu/LLaVA

    [NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

    Language:Python23.5k1601.6k2.6k
  • serve

    jina-ai/serve

    ☁️ Build multimodal AI applications with cloud-native stack

    Language:Python21.7k2151.9k2.2k
  • microsoft/unilm

    Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

    Language:Python21.7k3091.4k2.7k
  • deepseek-ai/Janus

    Janus-Series: Unified Multimodal Understanding and Generation Models

    Language:Python17.5k1501702.2k
  • mediar-ai/screenpipe

    AI app store powered by 24/7 desktop history. open source | 100% local | dev friendly | 24/7 screen, mic recording

    Language:TypeScript15.6k911k1.2k
  • NVIDIA/NeMo

    A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

    Language:Python13.6k2192.5k2.8k
  • modelscope/ms-swift

    Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 500+ LLMs (Qwen3, Qwen3-MoE, Llama4, GLM4.5, InternLM3, DeepSeek-R1, ...) and 200+ MLLMs (Qwen2.5-VL, Qwen2.5-Omni, InternVL3.5, Ovis2.5, Llava, GLM4v, Phi4, ...) (AAAI 2025).

    Language:Python9.9k443.3k870
  • rerun

    rerun-io/rerun

    Visualize streams of multimodal data. Free, fast, easy to use, and simple to integrate. Built in Rust.

    Language:Rust9.2k704.2k531
  • BentoML

    bentoml/BentoML

    The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

    Language:Python8.1k801.1k874
  • big-AGI

    enricoros/big-AGI

    AI suite powered by state-of-the-art models and providing advanced AI/AGI functions. It features AI personas, AGI functions, multi-model chats, text-to-image, voice, response streaming, code highlighting and execution, PDF import, presets for developers, much more. Deploy on-prem or in the cloud.

    Language:TypeScript6.6k776441.5k
  • SkalskiP/courses

    This repository is a curated collection of links to various courses and resources about Artificial Intelligence (AI)

    Language:Python6.2k1008561
  • swyxio/ai-notes

    notes for software engineers getting up to speed on new AI developments. Serves as datastore for https://latent.space writing, and product brainstorming, but has cleaned up canonical references under the /Resources folder.

    Language:HTML6k17911526
  • X-PLUG/MobileAgent

    Mobile-Agent: The Powerful GUI Agent Family

    Language:Python5.6k62110551
  • facebookresearch/mmf

    A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

    Language:Python5.6k110657941
  • TEN-framework/TEN-Agent

    TEN Agent is a conversational voice AI agent powered by TEN, integrating Deepseek, Gemini, OpenAI, RTC, and hardware like ESP32. It enables realtime AI capabilities like seeing, hearing, and speaking, and is fully compatible with platforms like Dify and Coze.

    Language:Python5.6k56222627
  • om-ai-lab/VLM-R1

    Solve Visual Understanding with Reinforced VLMs

    Language:Python5.5k45181350
  • pyspur

    PySpur-Dev/pyspur

    A visual playground for agentic workflows: Iterate over your agents 10x faster

    Language:TypeScript5.5k4744396
  • swarms

    kyegomez/swarms

    The Enterprise-Grade Production-Ready Multi-Agent Orchestration Framework. Website: https://swarms.ai

    Language:Python5.2k56394634
  • PKU-Alignment/align-anything

    Align Anything: Training All-modality Model with Feedback

    Language:Jupyter Notebook4.5k26048503
  • kyegomez/tree-of-thoughts

    Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models that Elevates Model Reasoning by atleast 70%

    Language:Python4.5k5370374
  • luban-agi/Awesome-AIGC-Tutorials

    Curated tutorials and resources for Large Language Models, AI Painting, and more.

  • rom1504/img2dataset

    Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

    Language:Python4.2k32279362
  • Fengshenbang-LM

    IDEA-CCNL/Fengshenbang-LM

    Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。

    Language:Python4.1k58300385
  • discoart

    jina-ai/discoart

    🪩 Create Disco Diffusion artworks in one line

    Language:Python3.8k34107245
  • open-mmlab/mmpretrain

    OpenMMLab Pre-training Toolbox and Benchmark

    Language:Python3.7k297941.1k
  • NExT-GPT/NExT-GPT

    Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large Language Model

    Language:Python3.6k61112360
  • atfortes/Awesome-LLM-Reasoning

    From Chain-of-Thought prompting to OpenAI o1 and DeepSeek-R1 🍓

  • OpenGVLab/InternGPT

    InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)

    Language:Python3.2k4151231
  • microsoft/torchscale

    Foundation Architecture for (M)LLMs

    Language:Python3.1k4486219
  • docarray

    docarray/docarray

    Represent, send, store and search multimodal data

    Language:Python3.1k45641234
  • InternLM/InternLM-XComposer

    InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

    Language:Python2.9k43425177
  • datachain

    iterative/datachain

    ETL, Analytics, Versioning for Unstructured Data

    Language:Python2.7k17347124
  • rom1504/clip-retrieval

    Easily compute clip embeddings and build a clip retrieval system with them

    Language:Jupyter Notebook2.6k25234233
  • roboflow/maestro

    streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL

    Language:Python2.6k3442216
  • OFA-Sys/OFA

    Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

    Language:Python2.5k20365248