vision-language

There are 159 repositories under vision-language topic.

IDEA-Research/GroundingDINO
[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
Language:Python7.7k 46 330777
salesforce/BLIP
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Language:Jupyter Notebook5.1k 31 206679
OFA-Sys/Chinese-CLIP
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
Language:Python5k 36 346488
marqo-ai/marqo
Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
Language:Python4.8k 40 242201
OFA-Sys/OFA
Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Language:Python2.5k 20 365249
AlibabaResearch/AdvancedLiterateMachinery
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
Language:C++1.7k 40 194191
mbzuai-oryx/Video-ChatGPT
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
Language:Python1.3k 14 125111
llm-jp/awesome-japanese-llm
日本語LLMまとめ - Overview of Japanese LLMs
Language:TypeScript1.1k 26 27333
OFA-Sys/ONE-PEACE
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Language:Python1k 14 5770
OpenDriveLab/DriveLM
[ECCV 2024 Oral] DriveLM: Driving with Graph Visual Question Answering
Language:HTML1k 22 11365
google-research/pix2seq
Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)
Language:Jupyter Notebook902 18 4970
mbzuai-oryx/LLaVA-pp
🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)
Language:Python835 9 3462
SunzeY/AlphaCLIP
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Language:Jupyter Notebook791 13 5954
TinyLLaVA/TinyLLaVA_Factory
A Framework of Small-scale Large Multimodal Models
Language:Python775 11 15582
Algolzw/daclip-uir
[ICLR 2024] Controlling Vision-Language Models for Universal Image Restoration. 5th place in the NTIRE 2024 Restore Any Image Model in the Wild Challenge.
Language:Python731 9 9939
AILab-CVC/SEED
Official implementation of SEED-LLaMA (ICLR 2024).
Language:Python603 16 5133
longzw1997/Open-GroundingDino
This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
Language:Python556 6 94102
mees/calvin
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
Language:Python515 5 8970
2U1/Qwen2-VL-Finetune
An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.
Language:Python494 5 8055
cliport/cliport
CLIPort: What and Where Pathways for Robotic Manipulation
Language:Jupyter Notebook483 6 3986
airaria/Visual-Chinese-LLaMA-Alpaca
多模态中文LLaMA&Alpaca大语言模型（VisualCLA）
Language:Python442 9 1337
zdou0830/METER
METER: A Multimodal End-to-end TransformER Framework
Language:Python367 6 3734
ChenDelong1999/RemoteCLIP
🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" (IEEE TGRS)
Language:Jupyter Notebook358 4 3922
henghuiding/Vision-Language-Transformer
[ICCV2021 & TPAMI2023] Vision-Language Transformer and Query Generation for Referring Segmentation
Language:Python352 4 1723
HUANGLIZI/LViT
[IEEE Transactions on Medical Imaging/TMI] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"
Language:Python332 2 5632
WisconsinAIVision/ViP-LLaVA
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Language:Python315 6 3123
movienet/movienet-tools
Tools for movie and video research
Language:C++288 10 4134
zjysteven/lmms-finetune
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.
Language:Python278 8 5728
TXH-mercury/VAST
[NIPS2023] Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Language:Jupyter Notebook272 17 2717
mertyg/vision-language-models-are-bows
Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
Language:Python270 7 3818
metauto-ai/Kaleido-BERT
💐Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
Language:Python265 3 1521
mbzuai-oryx/VideoGPT-plus
Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Language:Python264 5 2717
ALEEEHU/World-Simulator
Watch this repository for the latest updates! 🔥
224 8 213
MarSaKi/VLN-BEVBert
[ICCV 2023} Official repo of "BEVBert: Multimodal Map Pre-training for Language-guided Navigation"
Language:Python208 4 276
qiantianwen/NuScenes-QA
[AAAI 2024] NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario.
Language:Python179 14 123
OatmealLiu/FineR
[ICLR'24] Democratizing Fine-grained Visual Recognition with Large Language Models
Language:Python171 3 913

vision-language

IDEA-Research/GroundingDINO

salesforce/BLIP

OFA-Sys/Chinese-CLIP

marqo-ai/marqo

OFA-Sys/OFA

AlibabaResearch/AdvancedLiterateMachinery

mbzuai-oryx/Video-ChatGPT

llm-jp/awesome-japanese-llm

OFA-Sys/ONE-PEACE

OpenDriveLab/DriveLM

google-research/pix2seq

mbzuai-oryx/LLaVA-pp

SunzeY/AlphaCLIP

TinyLLaVA/TinyLLaVA_Factory

Algolzw/daclip-uir

AILab-CVC/SEED

longzw1997/Open-GroundingDino

mees/calvin

2U1/Qwen2-VL-Finetune

cliport/cliport

airaria/Visual-Chinese-LLaMA-Alpaca

zdou0830/METER

ChenDelong1999/RemoteCLIP

henghuiding/Vision-Language-Transformer

HUANGLIZI/LViT

WisconsinAIVision/ViP-LLaVA

movienet/movienet-tools

zjysteven/lmms-finetune

TXH-mercury/VAST

mertyg/vision-language-models-are-bows

metauto-ai/Kaleido-BERT

mbzuai-oryx/VideoGPT-plus

ALEEEHU/World-Simulator

MarSaKi/VLN-BEVBert

qiantianwen/NuScenes-QA

OatmealLiu/FineR