visual-language-models

There are 45 repositories under visual-language-models topic.

zai-org/CogVLM
a state-of-the-art-level open visual language model | 多模态预训练模型
Language:Python6.7k 71 441440
camel-ai/crab
🦀️ CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents. https://crab.camel-ai.org/
Language:Python373 5 3653
MiniMax-AI/One-RL-to-See-Them-All
The official repo of One RL to See Them All: Visual Triple Unified Reinforcement Learning
Language:Python31216
bilel-bj/ROSGPT_Vision
Commanding robots using only Language Models' prompts
Language:Python102 2 213
hk-zh/language-conditioned-robot-manipulation-models
https://arxiv.org/abs/2312.10807
74 4 11
xinyanghuang7/Basic-Visual-Language-Model
Build a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖
Language:Python46 3 28
BioMedIA-MBZUAI/FetalCLIP
Official repository of FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis
Language:Python450
kesimeg/awesome-turkish-language-models
A curated list of Turkish AI models, datasets, papers
45 3 1
jaisidhsingh/CoN-CLIP
Implementation of the "Learn No to Say Yes Better" paper.
Language:Python35 4 62
yangjie-cv/WeThink
WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning
Language:Python35
AlignGPT-VL/AlignGPT
Official repo for "AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability"
Language:Python33 1 15
tianyu-z/VCR
Official Repo for the paper: VCR: Visual Caption Restoration. Check arxiv.org/pdf/2406.06462 for details.
Language:Python31 1 02
Sid2697/HOI-Ref
Code implementation for paper titled "HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision"
Language:Python29 5 73
amathislab/wildclip
Scene and animal attribute retrieval from camera trap data with domain-adapted vision-language models
Language:Python25 1 11
sduzpf/UAP_VLP
Universal Adversarial Perturbations for Vision-Language Pre-trained Models
Language:Python21 2 20
csebuetnlp/IllusionVQA
This repository contains the data and code of the paper titled "IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models"
Language:Jupyter Notebook19 2 01
CristianoPatricio/concept-based-interpretability-VLM
Code for the paper "Towards Concept-based Interpretability of Skin Lesion Diagnosis using Vision-Language Models", ISBI 2024 (Oral).
Language:Jupyter Notebook15 1 42
Linvyl/DAM-QA
[ICCVW 2025] Implementation for DAM-QA: Describe Anything Model for Visual Question Answering on Text-rich Images
Language:Python12
declare-lab/Sealing
[NAACL 2024] Official Implementation of paper "Self-Adaptive Sampling for Efficient Video Question Answering on Image--Text Models"
Language:Python11 3 03
GraphPKU/CoI
Chain of Images for Intuitively Reasoning
Language:Python10 3 11
NxtGenLegend/TreeHacks-ZoneOut
#3 Winner of Best Use of Zoom API at Stanford TreeHacks 2025! An AI-powered meeting assistant that captures video, audio and textual context from Zoom calls using multimodal RAG.
Language:JavaScript8
shreydan/VLM-OD
experimental: finetune smolVLM on COCO (without any special <locXYZ> tokens)
Language:Jupyter Notebook8
AikyamLab/hallucinogen
A benchmark for evaluating hallucinations in large visual language models
Language:Python7 1 00
ArthurBabkin/Parimate
A Telegram bot for validating audio and video content using CV models, SR models, and VLMs, with deepfake detection leveraging metadata analysis.
Language:Python6 1 0
kornia/kornia-paligemma
Rust implementation of Google Paligemma with Candle
Language:Rust61
vlvink/PaliGemma-from-scratch
PaliGemma is a project created from scratch, based on a YouTube guide, to learn and demonstrate application/library/system creation. The project uses modern development approaches and best practices from the original tutorial.
Language:Python6 1 0
cplou99/FALCONEye
Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
50
nkkbr/ViCA
This is the official implementation of ViCA2 (Visuospatial Cognitive Assistant 2), a multimodal large language model designed for advanced visuospatial reasoning. The repository also provides training scripts for the original ViCA model.
Language:Python5
K1nght/T2I-ConBench
T2I-ConBench: Text-to-Image Benchmark for Continual Post-training
Language:Python4
ARResearch-1/DiverseAR-Dataset
Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble
3 1 00
laclouis5/uform-coreml-converters
CLI for converting UForm models to CoreML.
Language:Python3 1 10
tristandb8/PyTorch-PaliGemma-2
PyTorch implementation of PaliGemma 2
Language:Python3
fullscreen-triangle/pakati
A specialized tool that provides granular control over AI image generation by enabling region-based prompting, editing, and transformation with metacognitive orchestration.
Language:Python2
kornia/kornia-infernum
👺 Rust Inference engine for Visual Language Models
Language:Rust2
alessioborgi/RealTime-VLM
RealTime-VLM brings real-time VLM inference to the browser. It continuously captures webcam frames, sends image+text to an OpenAI-compatible API, and displays responses with sub-second latency. Works with local or hosted VLMs.
Language:JavaScript1
Mr-Wonderfool/Multimodal-Reinforce-CoT
Fine-tuning Qwen2.5-VL-3B-Instruct to output high quality chain-of-thoughts on GQA dataset with reinforcement learning
Language:Python1

visual-language-models

zai-org/CogVLM

camel-ai/crab

MiniMax-AI/One-RL-to-See-Them-All

bilel-bj/ROSGPT_Vision

hk-zh/language-conditioned-robot-manipulation-models

xinyanghuang7/Basic-Visual-Language-Model

BioMedIA-MBZUAI/FetalCLIP

kesimeg/awesome-turkish-language-models

jaisidhsingh/CoN-CLIP

yangjie-cv/WeThink

AlignGPT-VL/AlignGPT

tianyu-z/VCR

Sid2697/HOI-Ref

amathislab/wildclip

sduzpf/UAP_VLP

csebuetnlp/IllusionVQA

CristianoPatricio/concept-based-interpretability-VLM

Linvyl/DAM-QA

declare-lab/Sealing

GraphPKU/CoI

NxtGenLegend/TreeHacks-ZoneOut

shreydan/VLM-OD

AikyamLab/hallucinogen

ArthurBabkin/Parimate

kornia/kornia-paligemma

vlvink/PaliGemma-from-scratch

cplou99/FALCONEye

nkkbr/ViCA

K1nght/T2I-ConBench

ARResearch-1/DiverseAR-Dataset

laclouis5/uform-coreml-converters

tristandb8/PyTorch-PaliGemma-2

fullscreen-triangle/pakati

kornia/kornia-infernum

alessioborgi/RealTime-VLM

Mr-Wonderfool/Multimodal-Reinforce-CoT