vision-and-language
There are 275 repositories under vision-and-language topic.
aishwaryanr/awesome-generative-ai-guide
A one stop repository for generative AI research updates, interview resources, notebooks and much more!
salesforce/LAVIS
LAVIS - A One-stop Library for Language-Vision Intelligence
roboflow/maestro
streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL
om-ai-lab/OmAgent
Build multimodal language agents for fast prototype and production
salesforce/ALBEF
Code for ALBEF: a new vision-language pre-training method
open-mmlab/Multimodal-GPT
Multimodal-GPT
dandelin/ViLT
Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
om-ai-lab/OmDet
Real-time and accurate open-vocabulary end-to-end object detection
NVlabs/prismer
The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".
llm-jp/awesome-japanese-llm
日本語LLMまとめ - Overview of Japanese LLMs
yuewang-cuhk/awesome-vision-language-pretraining-papers
Recent Advances in Vision and Language PreTrained Models (VL-PTMs)
rhymes-ai/Aria
Codebase for Aria - an Open Multimodal Native MoE
OFA-Sys/ONE-PEACE
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
microsoft/Oscar
Oscar and VinVL
YehLi/xmodaler
X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval).
mbzuai-oryx/groundingLMM
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
InternRobotics/PointLLM
[ECCV 2024 Best Paper Candidate & TPAMI 2025] PointLLM: Empowering Large Language Models to Understand Point Clouds
NVlabs/DoRA
[ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation
26hzhang/DL-NLP-Readings
My Reading Lists of Deep Learning and Natural Language Processing
SunzeY/AlphaCLIP
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
ChenRocks/UNITER
Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"
SkalskiP/top-cvpr-2025-papers
About This repository is a curated collection of the most exciting and influential CVPR 2025 papers. 🔥 [Paper + Code + Demo]
jackroos/VL-BERT
Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".
SkalskiP/top-cvpr-2024-papers
This repository is a curated collection of the most exciting and influential CVPR 2024 papers. 🔥 [Paper + Code + Demo]
jayleicn/ClipBERT
[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.
mees/calvin
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
SkalskiP/top-cvpr-2023-papers
This repository is a curated collection of the most exciting and influential CVPR 2023 papers. 🔥 [Paper + Code]
peteanderson80/Matterport3DSimulator
AI Research Platform for Reinforcement Learning from Real Panoramic Images.
vardanagarwal/Proctoring-AI
Creating a software for automatic monitoring in online proctoring
sangminwoo/awesome-vision-and-language
A curated list of awesome vision and language resources (still under construction... stay tuned!)
eric-ai-lab/awesome-vision-language-navigation
A curated list for vision-and-language navigation. ACL 2022 paper "Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions"
zengyan-97/X-VLM
X-VLM: Multi-Grained Vision Language Pre-Training (ICML 2022)
JindongGu/Awesome-Prompting-on-Vision-Language-Model
This repo lists relevant papers summarized in our survey paper: A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models.
Paranioar/Awesome_Matching_Pretraining_Transfering
The Paper List of Large Multi-Modality Model (Perception, Generation, Unification), Parameter-Efficient Finetuning, Vision-Language Pretraining, Conventional Image-Text Matching for Preliminary Insight.
google-research-datasets/conceptual-12m
Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.
j-min/VL-T5
PyTorch code for "Unifying Vision-and-Language Tasks via Text Generation" (ICML 2021)