Pinned Repositories
Ask-Anything
[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
DragGAN
Unofficial Implementation of DragGAN - "Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold" (DragGAN 全功能实现,在线Demo,本地部署试用,代码、模型已全部开源,支持Windows, macOS, Linux)
InternGPT
InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)
InternImage
[CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
InternVideo
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
InternVL
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
LLaMA-Adapter
[ICLR 2024] Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters
OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
VideoMamba
[ECCV2024] VideoMamba: State Space Model for Efficient Video Understanding
VisionLLM
VisionLLM Series
OpenGVLab's Repositories
OpenGVLab/InternVL
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
OpenGVLab/InternVideo
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
OpenGVLab/OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
OpenGVLab/ScaleCUA
ScaleCUA is the open-sourced computer use agents that can operate on corss-platform environments (Windows, macOS, Ubuntu, Android).
OpenGVLab/VideoChat-Flash
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
OpenGVLab/OmniCorpus
[ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
OpenGVLab/PonderV2
[T-PAMI 2025] PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm
OpenGVLab/EfficientQAT
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
OpenGVLab/VideoChat-R1
[NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning
OpenGVLab/EgoVideo
[CVPR 2024 Champions][ICLR 2025] Solutions for EgoVis Chanllenges in CVPR 2024
OpenGVLab/GUI-Odyssey
[ICCV 2025] GUIOdyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUIOdyssey consists of 8,834 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 212 apps, and 1.4K app combos.
OpenGVLab/PIIP
[NeurIPS 2024 Spotlight ⭐️ & TPAMI 2025] Parameter-Inverted Image Pyramid Networks (PIIP)
OpenGVLab/ZeroGUI
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
OpenGVLab/Mono-InternVL
[CVPR 2025] Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
OpenGVLab/VeBrain
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
OpenGVLab/NaViL
OpenGVLab/MUTR
「AAAI 2024」 Referred by Multi-Modality: A Unified Temporal Transformers for Video Object Segmentation
OpenGVLab/EgoExoLearn
[CVPR 2024] Data and benchmark code for the EgoExoLearn dataset
OpenGVLab/SDLM
Sequential Diffusion Language Model (SDLM) enhances pre-trained autoregressive language models by adaptively determining generation length and maintaining KV-cache compatibility, achieving high efficiency and throughput.
OpenGVLab/LORIS
[ICML2023] Long-Term Rhythmic Video Soundtracker
OpenGVLab/TPO
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
OpenGVLab/PVC
[CVPR 2025] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
OpenGVLab/GenExam
GenExam: A Multidisciplinary Text-to-Image Exam
OpenGVLab/MetaCaptioner
OpenGVLab/Docopilot
[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding
OpenGVLab/FluxViT
Make Your Training Flexible: Towards Deployment-Efficient Video Models
OpenGVLab/Vlaser
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
OpenGVLab/VRBench
[ICCV 2025] A Benchmark for Multi-Step Reasoning in Long Narrative Videos
OpenGVLab/SID-VLN
Official implementation of: Learning Goal-Oriented Language-Guided Navigation with Self-Improving Demonstrations at Scale
OpenGVLab/ExpVid