vision-language-model

There are 607 repositories under vision-language-model topic.

  • haotian-liu/LLaVA

    [NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

    Language:Python23.9k1561.6k2.7k
  • OpenGVLab/InternVL

    [CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

    Language:Python9.5k661.1k733
  • CVHub520/X-AnyLabeling

    Effortless data labeling with AI support from Segment Anything and other awesome models.

    Language:Python6.9k421.1k773
  • QwenLM/Qwen-VL

    The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.

    Language:Python6.4k48477470
  • jingyaogong/minimind-v

    🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM!🌏 Train a 26M-parameter VLM from scratch in just 1 hours!

    Language:Python5.3k3983555
  • PKU-Alignment/align-anything

    Align Anything: Training All-modality Model with Feedback

    Language:Python4.6k26575504
  • deepseek-ai/DeepSeek-VL

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Language:Python4k3658580
  • volcengine/MineContext

    MineContext is your proactive context-aware AI partner(Context-Engineering+ChatGPT Pulse)

    Language:Python3.5k19123218
  • dvlab-research/MGM

    Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

    Language:Python3.3k25134278
  • MiniMax-AI/MiniMax-01

    The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention

    Language:Python3.2k4253312
  • jingyi0000/VLM_survey

    Collection of AWESOME vision-language models for vision tasks

  • InternLM/InternLM-XComposer

    InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

    Language:Python2.9k40437175
  • BAAI-Agents/Cradle

    The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.

    Language:Python2.3k2743226
  • illuin-tech/colpali

    The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.

    Language:Python2.3k19152211
  • Blaizzy/mlx-vlm

    MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.

    Language:Python1.8k23295203
  • AlibabaResearch/AdvancedLiterateMachinery

    A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

    Language:C++1.8k41206201
  • showlab/ShowUI

    [CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.

    Language:Python1.5k1885109
  • ByteDance-Seed/Seed1.5-VL

    Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.

    Language:Jupyter Notebook1.5k193558
  • emcf/thepipe

    Get clean data from tricky documents, powered by vision-language models ⚡

    Language:Python1.5k122694
  • NVlabs/describe-anything

    [ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioning

    Language:Python1.4k172479
  • AIDC-AI/Ovis

    A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

    Language:Python1.4k1911582
  • 2U1/Qwen-VL-Series-Finetune

    An open-source implementaion for fine-tuning Qwen-VL series by Alibaba Cloud.

    Language:Python1.4k170
  • llm-jp/awesome-japanese-llm

    日本語LLMまとめ - Overview of Japanese LLMs

    Language:TypeScript1.3k3031138
  • NVlabs/prismer

    The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

    Language:Python1.3k161974
  • SkalskiP/vlms-zero-to-hero

    This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.

    Language:Jupyter Notebook1.1k471103
  • gokayfem/awesome-vlm-architectures

    Famous Vision Language Models and Their Architectures

    Language:Markdown1.1k14450
  • PKU-YuanGroup/Chat-UniVi

    [CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

    Language:Python94276948
  • mbzuai-oryx/groundingLMM

    [CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

    Language:Python926289051
  • AIDC-AI/Awesome-Unified-Multimodal-Models

    Awesome Unified Multimodal Models

    86526
  • OpenBMB/VisRAG

    Parsing-free RAG supported by VLMs

    Language:Python853116668
  • SunzeY/AlphaCLIP

    [CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

    Language:Jupyter Notebook85296356
  • SkalskiP/top-cvpr-2025-papers

    About This repository is a curated collection of the most exciting and influential CVPR 2025 papers. 🔥 [Paper + Code + Demo]

    Language:Python80812246
  • Awesome-Robotics-3D

    zubair-irshad/Awesome-Robotics-3D

    A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites

  • StarlightSearch/EmbedAnything

    Highly Performant, Modular, Memory Safe and Production-ready Inference, Ingestion and Indexing built in Rust 🦀

    Language:Rust76386474
  • huangwl18/VoxPoser

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Language:Python748732100
  • zhengli97/Awesome-Prompt-Adapter-Learning-for-VLMs

    A curated list of awesome prompt/adapter learning methods for vision-language models like CLIP.