visual-language-models

There are 45 repositories under visual-language-models topic.

  • zai-org/CogVLM

    a state-of-the-art-level open visual language model | 多模态预训练模型

    Language:Python6.7k71441440
  • crab

    camel-ai/crab

    🦀️ CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents. https://crab.camel-ai.org/

    Language:Python37353653
  • MiniMax-AI/One-RL-to-See-Them-All

    The official repo of One RL to See Them All: Visual Triple Unified Reinforcement Learning

    Language:Python31216
  • bilel-bj/ROSGPT_Vision

    Commanding robots using only Language Models' prompts

    Language:Python1022213
  • xinyanghuang7/Basic-Visual-Language-Model

    Build a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖

    Language:Python46328
  • BioMedIA-MBZUAI/FetalCLIP

    Official repository of FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis

    Language:Python450
  • kesimeg/awesome-turkish-language-models

    A curated list of Turkish AI models, datasets, papers

  • jaisidhsingh/CoN-CLIP

    Implementation of the "Learn No to Say Yes Better" paper.

    Language:Python35462
  • yangjie-cv/WeThink

    WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning

    Language:Python35
  • AlignGPT-VL/AlignGPT

    Official repo for "AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability"

    Language:Python33115
  • tianyu-z/VCR

    Official Repo for the paper: VCR: Visual Caption Restoration. Check arxiv.org/pdf/2406.06462 for details.

    Language:Python31102
  • Sid2697/HOI-Ref

    Code implementation for paper titled "HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision"

    Language:Python29573
  • amathislab/wildclip

    Scene and animal attribute retrieval from camera trap data with domain-adapted vision-language models

    Language:Python25111
  • sduzpf/UAP_VLP

    Universal Adversarial Perturbations for Vision-Language Pre-trained Models

    Language:Python21220
  • csebuetnlp/IllusionVQA

    This repository contains the data and code of the paper titled "IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models"

    Language:Jupyter Notebook19201
  • CristianoPatricio/concept-based-interpretability-VLM

    Code for the paper "Towards Concept-based Interpretability of Skin Lesion Diagnosis using Vision-Language Models", ISBI 2024 (Oral).

    Language:Jupyter Notebook15142
  • Linvyl/DAM-QA

    [ICCVW 2025] Implementation for DAM-QA: Describe Anything Model for Visual Question Answering on Text-rich Images

    Language:Python12
  • declare-lab/Sealing

    [NAACL 2024] Official Implementation of paper "Self-Adaptive Sampling for Efficient Video Question Answering on Image--Text Models"

    Language:Python11303
  • GraphPKU/CoI

    Chain of Images for Intuitively Reasoning

    Language:Python10311
  • NxtGenLegend/TreeHacks-ZoneOut

    #3 Winner of Best Use of Zoom API at Stanford TreeHacks 2025! An AI-powered meeting assistant that captures video, audio and textual context from Zoom calls using multimodal RAG.

    Language:JavaScript8
  • shreydan/VLM-OD

    experimental: finetune smolVLM on COCO (without any special <locXYZ> tokens)

    Language:Jupyter Notebook8
  • AikyamLab/hallucinogen

    A benchmark for evaluating hallucinations in large visual language models

    Language:Python7100
  • ArthurBabkin/Parimate

    A Telegram bot for validating audio and video content using CV models, SR models, and VLMs, with deepfake detection leveraging metadata analysis.

    Language:Python610
  • kornia/kornia-paligemma

    Rust implementation of Google Paligemma with Candle

    Language:Rust61
  • vlvink/PaliGemma-from-scratch

    PaliGemma is a project created from scratch, based on a YouTube guide, to learn and demonstrate application/library/system creation. The project uses modern development approaches and best practices from the original tutorial.

    Language:Python610
  • cplou99/FALCONEye

    Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

    50
  • nkkbr/ViCA

    This is the official implementation of ViCA2 (Visuospatial Cognitive Assistant 2), a multimodal large language model designed for advanced visuospatial reasoning. The repository also provides training scripts for the original ViCA model.

    Language:Python5
  • K1nght/T2I-ConBench

    T2I-ConBench: Text-to-Image Benchmark for Continual Post-training

    Language:Python4
  • ARResearch-1/DiverseAR-Dataset

    Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble

  • laclouis5/uform-coreml-converters

    CLI for converting UForm models to CoreML.

    Language:Python3110
  • tristandb8/PyTorch-PaliGemma-2

    PyTorch implementation of PaliGemma 2

    Language:Python3
  • fullscreen-triangle/pakati

    A specialized tool that provides granular control over AI image generation by enabling region-based prompting, editing, and transformation with metacognitive orchestration.

    Language:Python2
  • kornia/kornia-infernum

    👺 Rust Inference engine for Visual Language Models

    Language:Rust2
  • RealTime-VLM

    alessioborgi/RealTime-VLM

    RealTime-VLM brings real-time VLM inference to the browser. It continuously captures webcam frames, sends image+text to an OpenAI-compatible API, and displays responses with sub-second latency. Works with local or hosted VLMs.

    Language:JavaScript1
  • Mr-Wonderfool/Multimodal-Reinforce-CoT

    Fine-tuning Qwen2.5-VL-3B-Instruct to output high quality chain-of-thoughts on GQA dataset with reinforcement learning

    Language:Python1