vision-language-learning
There are 10 repositories under vision-language-learning topic.
AIDC-AI/Ovis
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
shikiw/OPERA
[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
RLHF-V/RLAIF-V
RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness
YunzeMan/Situation3D
[CVPR 2024] Situational Awareness Matters in 3D Vision Language Reasoning
LooperXX/ManagerTower
Code for ACL 2023 Oral Paper: ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
SHTUPLUS/GITM-MR
The official implementation for the ICCV 2023 paper "Grounded Image Text Matching with Mismatched Relation Reasoning".
yubin1219/CrossVLT
Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation (Published in IEEE TMM 2023)
lyuchenyang/Dialogue-to-Video-Retrieval
Code for ECIR 2023 paper "Dialogue-to-Video Retrieval"
abhinav-neil/socratic-models
Socratic models for multimodal reasoning & image captioning
Ravi-Teja-konda/TunedLlavaDelights
Explore the rich flavors of Indian desserts with TunedLlavaDelights. Utilizing the in Llava fine-tuning, our project unveils detailed nutritional profiles, taste notes, and optimal consumption times for beloved sweets. Dive into a fusion of AI innovation and culinary tradition