vision-language

There are 159 repositories under vision-language topic.

  • IDEA-Research/GroundingDINO

    [ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"

    Language:Python7.7k46330777
  • salesforce/BLIP

    PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

    Language:Jupyter Notebook5.1k31206679
  • OFA-Sys/Chinese-CLIP

    Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.

    Language:Python5k36346488
  • marqo

    marqo-ai/marqo

    Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

    Language:Python4.8k40242201
  • OFA-Sys/OFA

    Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

    Language:Python2.5k20365249
  • AlibabaResearch/AdvancedLiterateMachinery

    A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

    Language:C++1.7k40194191
  • mbzuai-oryx/Video-ChatGPT

    [ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

    Language:Python1.3k14125111
  • llm-jp/awesome-japanese-llm

    日本語LLMまとめ - Overview of Japanese LLMs

    Language:TypeScript1.1k2627333
  • OFA-Sys/ONE-PEACE

    A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

    Language:Python1k145770
  • OpenDriveLab/DriveLM

    [ECCV 2024 Oral] DriveLM: Driving with Graph Visual Question Answering

    Language:HTML1k2211365
  • google-research/pix2seq

    Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)

    Language:Jupyter Notebook902184970
  • mbzuai-oryx/LLaVA-pp

    🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)

    Language:Python83593462
  • SunzeY/AlphaCLIP

    [CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

    Language:Jupyter Notebook791135954
  • TinyLLaVA/TinyLLaVA_Factory

    A Framework of Small-scale Large Multimodal Models

    Language:Python7751115582
  • Algolzw/daclip-uir

    [ICLR 2024] Controlling Vision-Language Models for Universal Image Restoration. 5th place in the NTIRE 2024 Restore Any Image Model in the Wild Challenge.

    Language:Python73199939
  • AILab-CVC/SEED

    Official implementation of SEED-LLaMA (ICLR 2024).

    Language:Python603165133
  • longzw1997/Open-GroundingDino

    This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.

    Language:Python556694102
  • mees/calvin

    CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

    Language:Python51558970
  • 2U1/Qwen2-VL-Finetune

    An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.

    Language:Python49458055
  • cliport/cliport

    CLIPort: What and Where Pathways for Robotic Manipulation

    Language:Jupyter Notebook48363986
  • airaria/Visual-Chinese-LLaMA-Alpaca

    多模态中文LLaMA&Alpaca大语言模型(VisualCLA)

    Language:Python44291337
  • zdou0830/METER

    METER: A Multimodal End-to-end TransformER Framework

    Language:Python36763734
  • ChenDelong1999/RemoteCLIP

    🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" (IEEE TGRS)

    Language:Jupyter Notebook35843922
  • henghuiding/Vision-Language-Transformer

    [ICCV2021 & TPAMI2023] Vision-Language Transformer and Query Generation for Referring Segmentation

    Language:Python35241723
  • HUANGLIZI/LViT

    [IEEE Transactions on Medical Imaging/TMI] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"

    Language:Python33225632
  • WisconsinAIVision/ViP-LLaVA

    [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

    Language:Python31563123
  • movienet/movienet-tools

    Tools for movie and video research

    Language:C++288104134
  • zjysteven/lmms-finetune

    A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.

    Language:Python27885728
  • TXH-mercury/VAST

    [NIPS2023] Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

    Language:Jupyter Notebook272172717
  • mertyg/vision-language-models-are-bows

    Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023

    Language:Python27073818
  • metauto-ai/Kaleido-BERT

    💐Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

    Language:Python26531521
  • mbzuai-oryx/VideoGPT-plus

    Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

    Language:Python26452717
  • ALEEEHU/World-Simulator

    Watch this repository for the latest updates! 🔥

  • MarSaKi/VLN-BEVBert

    [ICCV 2023} Official repo of "BEVBert: Multimodal Map Pre-training for Language-guided Navigation"

    Language:Python2084276
  • qiantianwen/NuScenes-QA

    [AAAI 2024] NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario.

    Language:Python17914123
  • OatmealLiu/FineR

    [ICLR'24] Democratizing Fine-grained Visual Recognition with Large Language Models

    Language:Python1713913