vision-language-model
There are 525 repositories under vision-language-model topic.
haotian-liu/LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
OpenGVLab/InternVL
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
QwenLM/Qwen-VL
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
jingyaogong/minimind-v
🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM!🌏 Train a 26M-parameter VLM from scratch in just 1 hours!
PKU-Alignment/align-anything
Align Anything: Training All-modality Model with Feedback
deepseek-ai/DeepSeek-VL
DeepSeek-VL: Towards Real-World Vision-Language Understanding
dvlab-research/MGM
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
MiniMax-AI/MiniMax-01
The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention
jingyi0000/VLM_survey
Collection of AWESOME vision-language models for vision tasks
InternLM/InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
BAAI-Agents/Cradle
The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.
illuin-tech/colpali
The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.
AlibabaResearch/AdvancedLiterateMachinery
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
Blaizzy/mlx-vlm
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
showlab/ShowUI
[CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.
ByteDance-Seed/Seed1.5-VL
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
AIDC-AI/Ovis
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
NVlabs/describe-anything
[ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioning
NVlabs/prismer
The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".
emcf/thepipe
Get clean data from tricky documents, powered by vision-language models ⚡
llm-jp/awesome-japanese-llm
日本語LLMまとめ - Overview of Japanese LLMs
2U1/Qwen2-VL-Finetune
An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.
SkalskiP/vlms-zero-to-hero
This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.
gokayfem/awesome-vlm-architectures
Famous Vision Language Models and Their Architectures
PKU-YuanGroup/Chat-UniVi
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
mbzuai-oryx/groundingLMM
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
SunzeY/AlphaCLIP
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
OpenBMB/VisRAG
Parsing-free RAG supported by VLMs
SkalskiP/top-cvpr-2025-papers
About This repository is a curated collection of the most exciting and influential CVPR 2025 papers. 🔥 [Paper + Code + Demo]
zubair-irshad/Awesome-Robotics-3D
A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites
huangwl18/VoxPoser
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
StarlightSearch/EmbedAnything
Highly Performant, Modular and Production-ready Inference, Ingestion and Indexing built in Rust 🦀
AIDC-AI/Awesome-Unified-Multimodal-Models
Awesome Unified Multimodal Models
zhengli97/Awesome-Prompt-Adapter-Learning-for-VLMs
A curated list of awesome prompt/adapter learning methods for vision-language models like CLIP.
neonwatty/meme-search
The open source Meme Search Engine and Finder. Free and built to self-host locally with Python, Ruby, and Docker.
FoundationVision/Groma
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization