multimodal
There are 1064 repositories under multimodal topic.
Mintplex-Labs/anything-llm
The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, No-code agent builder, MCP compatibility, and more.
haotian-liu/LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
jina-ai/serve
☁️ Build multimodal AI applications with cloud-native stack
microsoft/unilm
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
deepseek-ai/Janus
Janus-Series: Unified Multimodal Understanding and Generation Models
mediar-ai/screenpipe
AI app store powered by 24/7 desktop history. open source | 100% local | dev friendly | 24/7 screen, mic recording
NVIDIA/NeMo
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
modelscope/ms-swift
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 500+ LLMs (Qwen3, Qwen3-MoE, Llama4, GLM4.5, InternLM3, DeepSeek-R1, ...) and 200+ MLLMs (Qwen2.5-VL, Qwen2.5-Omni, InternVL3.5, Ovis2.5, Llava, GLM4v, Phi4, ...) (AAAI 2025).
rerun-io/rerun
Visualize streams of multimodal data. Free, fast, easy to use, and simple to integrate. Built in Rust.
bentoml/BentoML
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
enricoros/big-AGI
AI suite powered by state-of-the-art models and providing advanced AI/AGI functions. It features AI personas, AGI functions, multi-model chats, text-to-image, voice, response streaming, code highlighting and execution, PDF import, presets for developers, much more. Deploy on-prem or in the cloud.
SkalskiP/courses
This repository is a curated collection of links to various courses and resources about Artificial Intelligence (AI)
swyxio/ai-notes
notes for software engineers getting up to speed on new AI developments. Serves as datastore for https://latent.space writing, and product brainstorming, but has cleaned up canonical references under the /Resources folder.
X-PLUG/MobileAgent
Mobile-Agent: The Powerful GUI Agent Family
facebookresearch/mmf
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
TEN-framework/TEN-Agent
TEN Agent is a conversational voice AI agent powered by TEN, integrating Deepseek, Gemini, OpenAI, RTC, and hardware like ESP32. It enables realtime AI capabilities like seeing, hearing, and speaking, and is fully compatible with platforms like Dify and Coze.
om-ai-lab/VLM-R1
Solve Visual Understanding with Reinforced VLMs
PySpur-Dev/pyspur
A visual playground for agentic workflows: Iterate over your agents 10x faster
kyegomez/swarms
The Enterprise-Grade Production-Ready Multi-Agent Orchestration Framework. Website: https://swarms.ai
PKU-Alignment/align-anything
Align Anything: Training All-modality Model with Feedback
kyegomez/tree-of-thoughts
Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models that Elevates Model Reasoning by atleast 70%
luban-agi/Awesome-AIGC-Tutorials
Curated tutorials and resources for Large Language Models, AI Painting, and more.
rom1504/img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
IDEA-CCNL/Fengshenbang-LM
Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。
jina-ai/discoart
🪩 Create Disco Diffusion artworks in one line
open-mmlab/mmpretrain
OpenMMLab Pre-training Toolbox and Benchmark
NExT-GPT/NExT-GPT
Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large Language Model
atfortes/Awesome-LLM-Reasoning
From Chain-of-Thought prompting to OpenAI o1 and DeepSeek-R1 🍓
OpenGVLab/InternGPT
InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)
microsoft/torchscale
Foundation Architecture for (M)LLMs
docarray/docarray
Represent, send, store and search multimodal data
InternLM/InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
iterative/datachain
ETL, Analytics, Versioning for Unstructured Data
rom1504/clip-retrieval
Easily compute clip embeddings and build a clip retrieval system with them
roboflow/maestro
streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL
OFA-Sys/OFA
Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework