vision-language-model

There are 319 repositories under vision-language-model topic.

haotian-liu/LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Language:Python22k 158 1.6k2.4k
OpenGVLab/InternVL
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
Language:Python7.4k 57 845566
QwenLM/Qwen-VL
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
Language:Python5.7k 49 463434
deepseek-ai/DeepSeek-VL
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Language:Python3.7k 35 52556
dvlab-research/MGM
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
Language:Python3.3k 28 134282
PKU-Alignment/align-anything
Align Anything: Training All-modality Model with Feedback
Language:Python3.1k 260 48395
InternLM/InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Language:Python2.8k 43 424171
jingyi0000/VLM_survey
Collection of AWESOME vision-language models for vision tasks
2.6k 99 10202
MiniMax-AI/MiniMax-01
The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention
Language:Python2.4k 37 35179
BAAI-Agents/Cradle
The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.
Language:Python2k 27 41183
jingyaogong/minimind-v
🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM！🌏 Train a 26M-parameter VLM from scratch in just 1 hours!
Language:Python2k 24 40218
AlibabaResearch/AdvancedLiterateMachinery
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
Language:C++1.7k 40 194190
illuin-tech/colpali
The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.
Language:Python1.7k 18 101141
NVlabs/prismer
The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".
Language:Python1.3k 16 1973
showlab/ShowUI
[CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.
Language:Python1.1k 15 6771
llm-jp/awesome-japanese-llm
日本語LLMまとめ - Overview of Japanese LLMs
Language:TypeScript1.1k 26 27533
Blaizzy/mlx-vlm
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
Language:Python1.1k 12 161100
SkalskiP/vlms-zero-to-hero
This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.
Language:Jupyter Notebook1.1k 44 197
PKU-YuanGroup/Chat-UniVi
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Language:Python928 9 6845
mbzuai-oryx/groundingLMM
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
Language:Python856 32 8345
AIDC-AI/Ovis
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
Language:Python850 13 6856
SunzeY/AlphaCLIP
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Language:Jupyter Notebook794 12 5955
gokayfem/awesome-vlm-architectures
Famous Vision Language Models and Their Architectures
Language:Markdown751 15 438
zubair-irshad/Awesome-Robotics-3D
A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites
666 15 334
huangwl18/VoxPoser
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Language:Python652 9 3086
OpenBMB/VisRAG
Parsing-free RAG supported by VLMs
Language:Python648 11 4352
FoundationVision/Groma
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
Language:Python555 27 4244
2U1/Qwen2-VL-Finetune
An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.
Language:Python538 4 8958
neonwatty/meme-search
The open source Meme Search Engine and Finder. Free and built to self-host locally with Python, Ruby, and Docker.
Language:Ruby530 4 2623
OpenGVLab/Multi-Modality-Arena
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
Language:Python508 7 2837
StarlightSearch/EmbedAnything
Production-ready Inference, Ingestion and Indexing built in Rust 🦀
Language:Rust496 7 3842
zhengli97/Awesome-Prompt-Adapter-Learning-for-VLMs
A curated list of awesome prompt/adapter learning methods for vision-language models like CLIP.
484 10 421
Flame-Code-VLM/Flame-Code-VLM
Flame is an open-source multimodal AI system designed to translate UI design mockups into high-quality React code. It leverages vision-language modeling, automated data synthesis, and structured training workflows to bridge the gap between design and front-end development.
Language:Python477 7 831
AlaaLab/InstructCV
[ ICLR 2024 ] Official Codebase for "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists"
Language:Python462 20 750
PJLab-ADG/awesome-knowledge-driven-AD
A curated list of awesome knowledge-driven autonomous driving (continually updated)
456 23 024
ictnlp/LLaVA-Mini
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
Language:Python428 9 2518

vision-language-model

haotian-liu/LLaVA

OpenGVLab/InternVL

QwenLM/Qwen-VL

deepseek-ai/DeepSeek-VL

dvlab-research/MGM

PKU-Alignment/align-anything

InternLM/InternLM-XComposer

jingyi0000/VLM_survey

MiniMax-AI/MiniMax-01

BAAI-Agents/Cradle

jingyaogong/minimind-v

AlibabaResearch/AdvancedLiterateMachinery

illuin-tech/colpali

NVlabs/prismer

showlab/ShowUI

llm-jp/awesome-japanese-llm

Blaizzy/mlx-vlm

SkalskiP/vlms-zero-to-hero

PKU-YuanGroup/Chat-UniVi

mbzuai-oryx/groundingLMM

AIDC-AI/Ovis

SunzeY/AlphaCLIP

gokayfem/awesome-vlm-architectures

zubair-irshad/Awesome-Robotics-3D

huangwl18/VoxPoser

OpenBMB/VisRAG

FoundationVision/Groma

2U1/Qwen2-VL-Finetune

neonwatty/meme-search

OpenGVLab/Multi-Modality-Arena

StarlightSearch/EmbedAnything

zhengli97/Awesome-Prompt-Adapter-Learning-for-VLMs

Flame-Code-VLM/Flame-Code-VLM

AlaaLab/InstructCV

PJLab-ADG/awesome-knowledge-driven-AD

ictnlp/LLaVA-Mini