multi-modality
There are 86 repositories under multi-modality topic.
haotian-liu/LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
BradyFU/Awesome-Multimodal-Large-Language-Models
:sparkles::sparkles:Latest Advances on Multimodal Large Language Models
jina-ai/clip-as-service
🏄 Scalable embedding, reasoning, ranking for images and sentences with CLIP
kyegomez/swarms
The Enterprise-Grade Production-Ready Multi-Agent Orchestration Framework. Website: https://swarms.ai
lucidrains/deep-daze
Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network). Technique was originally created by https://twitter.com/advadnoun
EvolvingLMMs-Lab/Otter
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
InternLM/InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
DLR-RM/3DObjectTracking
Algorithms and Publications on 3D Object Tracking
OpenBMB/VisRAG
Parsing-free RAG supported by VLMs
OpenGVLab/Multi-Modality-Arena
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
LSXI7/MINIMA
[CVPR 2025] MINIMA: Modality Invariant Image Matching
kyegomez/Gemini
The open source implementation of Gemini, the model that will "eclipse ChatGPT" by Google
researchmm/MM-Diffusion
[CVPR'23] MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
ziqihuangg/Collaborative-Diffusion
[CVPR 2023] Collaborative Diffusion
xiaoachen98/Open-LLaVA-NeXT
An open-source implementation for training LLaVA-NeXT.
kyegomez/Sophia
Effortless plugin and play Optimizer to cut model training costs by 50%. New optimizer that is 2x faster than Adam on LLMs.
dvlab-research/VisionZip
Official repository for VisionZip (CVPR 2025)
RLHF-V/RLHF-V
[CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
DerrickWang005/CRIS.pytorch
An official PyTorch implementation of the CRIS paper
ZwwWayne/mmMOT
[ICCV2019] Robust Multi-Modality Multi-Object Tracking
dvlab-research/UVTR
Unifying Voxel-based Representation with Transformer for 3D Object Detection (NeurIPS 2022)
jackyjsy/CVPR21Chal-SLR
This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.
yangcaoai/CoDA_NeurIPS2023
Official code for NeurIPS2023 paper: CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection
sshh12/multi_token
Embed arbitrary modalities (images, audio, documents, etc) into large language models.
ChenHongruixuan/BRIGHT
[IEEE GRSS DFC 2025 Track II] BRIGHT: A globally distributed multimodal VHR dataset for all-weather disaster response
jina-ai/rungpt
An open-source cloud-native of large multi-modal models (LMMs) serving framework.
Lee-Gihun/MEDIAR
(NeurIPS 2022 CellSeg Challenge - 1st Winner) Open source code for "MEDIAR: Harmony of Data-Centric and Model-Centric for Multi-Modality Microscopy"
dvlab-research/Prompt-Highlighter
[CVPR 2024] Prompt Highlighter: Interactive Control for Multi-Modal LLMs
kyegomez/Andromeda
An all-new Language Model That Processes Ultra-Long Sequences of 100,000+ Ultra-Fast
kyegomez/the-compiler
Seed, Code, Harvest: Grow Your Own App with Tree of Thoughts!
skit-ai/SpeechLLM
This repository contains the training, inference, evaluation code for SpeechLLM models and details about the model releases on huggingface.
kyegomez/MambaByte
Implementation of MambaByte in "MambaByte: Token-free Selective State Space Model" in Pytorch and Zeta
kyegomez/MoE-Mamba
Implementation of MoE Mamba from the paper: "MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts" in Pytorch and Zeta
SsGood/MMGL
Multi-modal Graph learning for Disease Prediction (IEEE Trans. on Medical imaging, TMI2022)
rsy6318/CorrI2P
[TCSVT] CorrI2P: Deep Image-to-Point Cloud Registration via Dense CorrespondenceThe code of CorrI2P
kyegomez/Kosmos2.5
My implementation of Kosmos2.5 from the paper: "KOSMOS-2.5: A Multimodal Literate Model"