multi-modality

There are 86 repositories under multi-modality topic.

  • haotian-liu/LLaVA

    [NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

    Language:Python23.6k1601.6k2.6k
  • BradyFU/Awesome-Multimodal-Large-Language-Models

    :sparkles::sparkles:Latest Advances on Multimodal Large Language Models

  • clip-as-service

    jina-ai/clip-as-service

    🏄 Scalable embedding, reasoning, ranking for images and sentences with CLIP

    Language:Python12.7k2226152.1k
  • swarms

    kyegomez/swarms

    The Enterprise-Grade Production-Ready Multi-Agent Orchestration Framework. Website: https://swarms.ai

    Language:Python5.2k56394637
  • lucidrains/deep-daze

    Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network). Technique was originally created by https://twitter.com/advadnoun

    Language:Python4.3k74166316
  • Otter

    EvolvingLMMs-Lab/Otter

    🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.

    Language:Python3.3k80165208
  • InternLM/InternLM-XComposer

    InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

    Language:Python2.9k43436177
  • DLR-RM/3DObjectTracking

    Algorithms and Publications on 3D Object Tracking

    Language:C++9192570163
  • OpenBMB/VisRAG

    Parsing-free RAG supported by VLMs

    Language:Python786125358
  • OpenGVLab/Multi-Modality-Arena

    Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!

    Language:Python53872839
  • LSXI7/MINIMA

    [CVPR 2025] MINIMA: Modality Invariant Image Matching

    Language:Python48794634
  • kyegomez/Gemini

    The open source implementation of Gemini, the model that will "eclipse ChatGPT" by Google

    Language:Python45911861
  • researchmm/MM-Diffusion

    [CVPR'23] MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

    Language:Python44252424
  • ziqihuangg/Collaborative-Diffusion

    [CVPR 2023] Collaborative Diffusion

    Language:Python43084235
  • xiaoachen98/Open-LLaVA-NeXT

    An open-source implementation for training LLaVA-NeXT.

    Language:Python419113122
  • kyegomez/Sophia

    Effortless plugin and play Optimizer to cut model training costs by 50%. New optimizer that is 2x faster than Adam on LLMs.

    Language:Python38282526
  • dvlab-research/VisionZip

    Official repository for VisionZip (CVPR 2025)

    Language:Python34751715
  • RLHF-V/RLHF-V

    [CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

    Language:Python2922298
  • DerrickWang005/CRIS.pytorch

    An official PyTorch implementation of the CRIS paper

    Language:Python27812138
  • ZwwWayne/mmMOT

    [ICCV2019] Robust Multi-Modality Multi-Object Tracking

    Language:Python256234623
  • dvlab-research/UVTR

    Unifying Voxel-based Representation with Transformer for 3D Object Detection (NeurIPS 2022)

    Language:Python24263417
  • CVPR21Chal-SLR

    jackyjsy/CVPR21Chal-SLR

    This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

    Language:Python21733352
  • yangcaoai/CoDA_NeurIPS2023

    Official code for NeurIPS2023 paper: CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection

    Language:Jupyter Notebook210111816
  • sshh12/multi_token

    Embed arbitrary modalities (images, audio, documents, etc) into large language models.

    Language:Python18732616
  • ChenHongruixuan/BRIGHT

    [IEEE GRSS DFC 2025 Track II] BRIGHT: A globally distributed multimodal VHR dataset for all-weather disaster response

    Language:Python1694622
  • jina-ai/rungpt

    An open-source cloud-native of large multi-modal models (LMMs) serving framework.

    Language:Python167211722
  • Lee-Gihun/MEDIAR

    (NeurIPS 2022 CellSeg Challenge - 1st Winner) Open source code for "MEDIAR: Harmony of Data-Centric and Model-Centric for Multi-Modality Microscopy"

    Language:Python15342135
  • dvlab-research/Prompt-Highlighter

    [CVPR 2024] Prompt Highlighter: Interactive Control for Multi-Modal LLMs

    Language:Python152254
  • kyegomez/Andromeda

    An all-new Language Model That Processes Ultra-Long Sequences of 100,000+ Ultra-Fast

    Language:Python15210823
  • kyegomez/the-compiler

    Seed, Code, Harvest: Grow Your Own App with Tree of Thoughts!

    Language:Python1444416
  • skit-ai/SpeechLLM

    This repository contains the training, inference, evaluation code for SpeechLLM models and details about the model releases on huggingface.

    Language:Python1215310
  • kyegomez/MambaByte

    Implementation of MambaByte in "MambaByte: Token-free Selective State Space Model" in Pytorch and Zeta

    Language:Python115327
  • kyegomez/MoE-Mamba

    Implementation of MoE Mamba from the paper: "MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts" in Pytorch and Zeta

    Language:Python110547
  • SsGood/MMGL

    Multi-modal Graph learning for Disease Prediction (IEEE Trans. on Medical imaging, TMI2022)

    Language:Jupyter Notebook10521917
  • CorrI2P

    rsy6318/CorrI2P

    [TCSVT] CorrI2P: Deep Image-to-Point Cloud Registration via Dense CorrespondenceThe code of CorrI2P

    Language:Python913239
  • kyegomez/Kosmos2.5

    My implementation of Kosmos2.5 from the paper: "KOSMOS-2.5: A Multimodal Literate Model"

    Language:Python73226