vlms
There are 27 repositories under vlms topic.
NanoNets/docext
An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)
yueliu1999/Awesome-Jailbreak-on-LLMs
Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses.
dvlab-research/VisionZip
Official repository for VisionZip (CVPR 2025)
tianyi-lab/HallusionBench
[CVPR'24] HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Beckschen/ViTamin
[CVPR 2024] Official implementation of "ViTamin: Designing Scalable Vision Models in the Vision-language Era"
MCG-NJU/AWT
[NeurIPS 2024] AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation
aim-uofa/SegAgent
[CVPR2025] SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories
foundation-multimodal-models/CAL
[NeurIPS'24] Official PyTorch Implementation of Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
mbzuai-oryx/KITAB-Bench
[ACL 2025 🔥] A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
video-db/ocr-benchmark
Benchmarking Vision-Language Models on OCR tasks in Dynamic Video Environments
Mamadou-Keita/VLM-DETECT
[ICASSP 2024] The official repo for Harnessing the Power of Large Vision Language Models for Synthetic Image Detection
FSoft-AI4Code/VisualCoder
[NAACL 2025] Guiding Large Language Models in Code Execution with Fine-grained Multimodal Chain-of-Thought Reasoning
ThomasVonWu/Awesome-VLMs-Strawberry
A collection of VLMs papers, blogs, and projects, with a focus on VLMs in Autonomous Driving and related reasoning techniques.
Imageomics/VLM4Bio
Code for VLM4Bio, a benchmark dataset of scientific question-answer pairs used to evaluate pretrained VLMs for trait discovery from biological images.
logic-OT/BobVLM
BobVLM – A 1.5B multimodal model built from scratch and pre-trained on a single P100 GPU capable of image descriptions and moderate question answering. 🤗🎉
PGSmall/clip-pgs
Official code for CVPR2025 "Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection"
ShenzheZhu/JailDAM
JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model
hucebot/words2contact
Official implementation of "Words2Contact: Identifying Support Contacts from Verbal Instructions Using Foundation Models" (IEEE-RAS Humanoids 2024).
Raymond-Qiancx/Awesome-Multimodal-Machine-Learning-Papers
Taxonomy and listing of current powerful studies in Advanced Multimodal Machine Learning.
SrGrace/generative-ai-compass
A comprehensive guide to navigating the world of generative artificial intelligence!
VectorInstitute/VLDBench
VLDBench: A large-scale benchmark for evaluating Vision-Language Models (VLMs) and Large Language Models (LLMs) on multimodal disinformation detection.
werywjw/MultiClimate
[EMNLP 2024 Workshop NLP4PI]🌏 MultiClimate: Multimodal Stance Detection on Climate Change Videos 🌎
yasho191/SwiftAnnotate
Auto labelling tool for Text, Image, Video
angmavrogiannis/Embodied-Attribute-Detection
Code for the ICRA 2025 paper: Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs
KT313/assistant_base
A custom framework for easy use of LLMs, VLMs, etc. supporting various modes and settings via web-ui
khurramHashmi/LLaVA-v1.6-Mistral-7b-Finetune-ORPO-RLAIF-V
Align llava-v1.6-mistral-7b on RLAIF-V dataset using ORPO