visual-language-learning
There are 14 repositories under visual-language-learning topic.
haotian-liu/LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
NExT-GPT/NExT-GPT
Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large Language Model
EvolvingLMMs-Lab/Otter
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
InternLM/InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
xiaoachen98/Open-LLaVA-NeXT
An open-source implementation for training LLaVA-NeXT.
RLHF-V/RLHF-V
[CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
mlpc-ucsd/BLIVA
(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
thomas-yanxin/KarmaVLM
🧘🏻♂️KarmaVLM (相生):A family of high efficiency and powerful visual language model.
AdrianBZG/llama-multimodal-vqa
Multimodal Instruction Tuning for Llama 3
xinyanghuang7/Basic-Visual-Language-Model
Build a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖
Skyline-9/Shotluck-Holmes
[ACM MMGR '24] 🔍 Shotluck Holmes: A family of small-scale LLVMs for shot-level video understanding
ashleykleynhans/llava-docker
Docker image for LLaVA: Large Language and Vision Assistant
MuhammadAliS/CLIP
PyTorch implementation of OpenAI's CLIP model for image classification, visual search, and visual question answering (VQA).
ecoxial2007/EffVideoQA
Efficient Video Question Answering