vision-language-model

There are 319 repositories under vision-language-model topic.

  • haotian-liu/LLaVA

    [NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

    Language:Python22k1581.6k2.4k
  • OpenGVLab/InternVL

    [CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

    Language:Python7.4k57845566
  • QwenLM/Qwen-VL

    The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.

    Language:Python5.7k49463434
  • deepseek-ai/DeepSeek-VL

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Language:Python3.7k3552556
  • dvlab-research/MGM

    Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

    Language:Python3.3k28134282
  • PKU-Alignment/align-anything

    Align Anything: Training All-modality Model with Feedback

    Language:Python3.1k26048395
  • InternLM/InternLM-XComposer

    InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

    Language:Python2.8k43424171
  • jingyi0000/VLM_survey

    Collection of AWESOME vision-language models for vision tasks

  • MiniMax-AI/MiniMax-01

    The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention

    Language:Python2.4k3735179
  • BAAI-Agents/Cradle

    The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.

    Language:Python2k2741183
  • jingyaogong/minimind-v

    🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM!🌏 Train a 26M-parameter VLM from scratch in just 1 hours!

    Language:Python2k2440218
  • AlibabaResearch/AdvancedLiterateMachinery

    A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

    Language:C++1.7k40194190
  • illuin-tech/colpali

    The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.

    Language:Python1.7k18101141
  • NVlabs/prismer

    The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

    Language:Python1.3k161973
  • showlab/ShowUI

    [CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.

    Language:Python1.1k156771
  • llm-jp/awesome-japanese-llm

    日本語LLMまとめ - Overview of Japanese LLMs

    Language:TypeScript1.1k2627533
  • Blaizzy/mlx-vlm

    MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.

    Language:Python1.1k12161100
  • SkalskiP/vlms-zero-to-hero

    This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.

    Language:Jupyter Notebook1.1k44197
  • PKU-YuanGroup/Chat-UniVi

    [CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

    Language:Python92896845
  • mbzuai-oryx/groundingLMM

    [CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

    Language:Python856328345
  • AIDC-AI/Ovis

    A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

    Language:Python850136856
  • SunzeY/AlphaCLIP

    [CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

    Language:Jupyter Notebook794125955
  • gokayfem/awesome-vlm-architectures

    Famous Vision Language Models and Their Architectures

    Language:Markdown75115438
  • Awesome-Robotics-3D

    zubair-irshad/Awesome-Robotics-3D

    A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites

  • huangwl18/VoxPoser

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Language:Python65293086
  • OpenBMB/VisRAG

    Parsing-free RAG supported by VLMs

    Language:Python648114352
  • FoundationVision/Groma

    [ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization

    Language:Python555274244
  • 2U1/Qwen2-VL-Finetune

    An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.

    Language:Python53848958
  • neonwatty/meme-search

    The open source Meme Search Engine and Finder. Free and built to self-host locally with Python, Ruby, and Docker.

    Language:Ruby53042623
  • OpenGVLab/Multi-Modality-Arena

    Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!

    Language:Python50872837
  • StarlightSearch/EmbedAnything

    Production-ready Inference, Ingestion and Indexing built in Rust 🦀

    Language:Rust49673842
  • zhengli97/Awesome-Prompt-Adapter-Learning-for-VLMs

    A curated list of awesome prompt/adapter learning methods for vision-language models like CLIP.

  • Flame-Code-VLM

    Flame-Code-VLM/Flame-Code-VLM

    Flame is an open-source multimodal AI system designed to translate UI design mockups into high-quality React code. It leverages vision-language modeling, automated data synthesis, and structured training workflows to bridge the gap between design and front-end development.

    Language:Python4777831
  • AlaaLab/InstructCV

    [ ICLR 2024 ] Official Codebase for "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists"

    Language:Python46220750
  • PJLab-ADG/awesome-knowledge-driven-AD

    A curated list of awesome knowledge-driven autonomous driving (continually updated)

  • ictnlp/LLaVA-Mini

    LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

    Language:Python42892518