vlms

There are 27 repositories under vlms topic.

  • NanoNets/docext

    An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

    Language:Python1.7k2034131
  • yueliu1999/Awesome-Jailbreak-on-LLMs

    Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses.

  • dvlab-research/VisionZip

    Official repository for VisionZip (CVPR 2025)

    Language:Python34751715
  • tianyi-lab/HallusionBench

    [CVPR'24] HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models

    Language:Python2975138
  • Beckschen/ViTamin

    [CVPR 2024] Official implementation of "ViTamin: Designing Scalable Vision Models in the Vision-language Era"

    Language:Python2097116
  • MCG-NJU/AWT

    [NeurIPS 2024] AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

    Language:Python107416
  • aim-uofa/SegAgent

    [CVPR2025] SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

    Language:Python70411
  • foundation-multimodal-models/CAL

    [NeurIPS'24] Official PyTorch Implementation of Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

    Language:Python57052
  • mbzuai-oryx/KITAB-Bench

    [ACL 2025 🔥] A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

    Language:Python504
  • video-db/ocr-benchmark

    Benchmarking Vision-Language Models on OCR tasks in Dynamic Video Environments

    Language:Python44103
  • Mamadou-Keita/VLM-DETECT

    [ICASSP 2024] The official repo for Harnessing the Power of Large Vision Language Models for Synthetic Image Detection

    Language:Python26253
  • FSoft-AI4Code/VisualCoder

    [NAACL 2025] Guiding Large Language Models in Code Execution with Fine-grained Multimodal Chain-of-Thought Reasoning

    Language:Jupyter Notebook11201
  • ThomasVonWu/Awesome-VLMs-Strawberry

    A collection of VLMs papers, blogs, and projects, with a focus on VLMs in Autonomous Driving and related reasoning techniques.

  • Imageomics/VLM4Bio

    Code for VLM4Bio, a benchmark dataset of scientific question-answer pairs used to evaluate pretrained VLMs for trait discovery from biological images.

    Language:Python7103
  • BobVLM

    logic-OT/BobVLM

    BobVLM – A 1.5B multimodal model built from scratch and pre-trained on a single P100 GPU capable of image descriptions and moderate question answering. 🤗🎉

    Language:Python5222
  • PGSmall/clip-pgs

    Official code for CVPR2025 "Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection"

    Language:Python5
  • ShenzheZhu/JailDAM

    JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model

  • hucebot/words2contact

    Official implementation of "Words2Contact: Identifying Support Contacts from Verbal Instructions Using Foundation Models" (IEEE-RAS Humanoids 2024).

    Language:Python4300
  • Raymond-Qiancx/Awesome-Multimodal-Machine-Learning-Papers

    Taxonomy and listing of current powerful studies in Advanced Multimodal Machine Learning.

  • SrGrace/generative-ai-compass

    A comprehensive guide to navigating the world of generative artificial intelligence!

  • VectorInstitute/VLDBench

    VLDBench: A large-scale benchmark for evaluating Vision-Language Models (VLMs) and Large Language Models (LLMs) on multimodal disinformation detection.

    Language:JavaScript3
  • werywjw/MultiClimate

    [EMNLP 2024 Workshop NLP4PI]🌏 MultiClimate: Multimodal Stance Detection on Climate Change Videos 🌎

    Language:Jupyter Notebook2200
  • SwiftAnnotate

    yasho191/SwiftAnnotate

    Auto labelling tool for Text, Image, Video

    Language:Python20
  • angmavrogiannis/Embodied-Attribute-Detection

    Code for the ICRA 2025 paper: Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs

    Language:Python1100
  • KT313/assistant_base

    A custom framework for easy use of LLMs, VLMs, etc. supporting various modes and settings via web-ui

    Language:Jupyter Notebook1100
  • khurramHashmi/LLaVA-v1.6-Mistral-7b-Finetune-ORPO-RLAIF-V

    Align llava-v1.6-mistral-7b on RLAIF-V dataset using ORPO

    Language:Python00
  • LiAo365/EPSR_VTG

    Language:Python0100