vision-language

There are 124 repositories under vision-language topic.

  • IDEA-Research/GroundingDINO

    Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"

    Language:Python5.3k35274557
  • salesforce/BLIP

    PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

    Language:Jupyter Notebook4.4k34188577
  • marqo

    marqo-ai/marqo

    Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

    Language:Python4.2k35232178
  • OFA-Sys/Chinese-CLIP

    Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.

    Language:Python3.8k35294404
  • OFA-Sys/OFA

    Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

    Language:Python2.3k21359247
  • AlibabaResearch/AdvancedLiterateMachinery

    A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

    Language:C++1.1k26137133
  • mbzuai-oryx/Video-ChatGPT

    [ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

    Language:Python9981410288
  • OFA-Sys/ONE-PEACE

    A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

    Language:Python858125052
  • google-research/pix2seq

    Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)

    Language:Jupyter Notebook825184867
  • llm-jp/awesome-japanese-llm

    日本語LLMまとめ - Overview of Japanese LLMs

  • mbzuai-oryx/LLaVA-pp

    🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)

    Language:Python692103046
  • OpenDriveLab/DriveLM

    DriveLM: Driving with Graph Visual Question Answering

    Language:HTML660206340
  • Algolzw/daclip-uir

    [ICLR 2024] Controlling Vision-Language Models for Universal Image Restoration. 5th place in the NTIRE 2024 Restore Any Image Model in the Wild Challenge.

    Language:Python56185028
  • SunzeY/AlphaCLIP

    [CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

    Language:Jupyter Notebook527103728
  • AILab-CVC/SEED

    Official implementation of SEED-LLaMA (ICLR 2024).

    Language:Python499144027
  • cliport/cliport

    CLIPort: What and Where Pathways for Robotic Manipulation

    Language:Jupyter Notebook42563680
  • airaria/Visual-Chinese-LLaMA-Alpaca

    多模态中文LLaMA&Alpaca大语言模型(VisualCLA)

    Language:Python38191236
  • TinyLLaVA/TinyLLaVA_Factory

    A Framework of Small-scale Large Multimodal Models

    Language:Python375116427
  • henghuiding/Vision-Language-Transformer

    [ICCV2021 & TPAMI2023] Vision-Language Transformer and Query Generation for Referring Segmentation

    Language:Python33541721
  • mees/calvin

    CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

    Language:Python28267144
  • mczhuge/Kaleido-BERT

    💐Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

    Language:Python26421519
  • longzw1997/Open-GroundingDino

    This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.

    Language:Python26226941
  • HUANGLIZI/LViT

    [IEEE Transactions on Medical Imaging/TMI] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"

    Language:Python26044524
  • movienet/movienet-tools

    Tools for movie and video research

    Language:C++255113829
  • ChenDelong1999/RemoteCLIP

    🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" (IEEE TGRS)

    Language:Jupyter Notebook20942713
  • mertyg/vision-language-models-are-bows

    Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023

    Language:Python20783614
  • TXH-mercury/VAST

    Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

    Language:Jupyter Notebook199182212
  • WisconsinAIVision/ViP-LLaVA

    [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

    Language:Python17851610
  • MarSaKi/VLN-BEVBert

    [ICCV 2023} Official repo of "BEVBert: Multimodal Map Pre-training for Language-guided Navigation"

    Language:Python1664174
  • woodfrog/vse_infty

    Code for "Learning the Best Pooling Strategy for Visual Semantic Embedding", CVPR 2021

    Language:Python1504918
  • qiantianwen/NuScenes-QA

    [AAAI 2024] NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario.

  • howard-hou/BagFormer

    PyTorch code for BagFormer: Better Cross-Modal Retrieval via bag-wise interaction

    Language:Python11330033
  • MikeWangWZHL/VidIL

    Pytorch code for Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

    Language:Python1105112
  • amazon-science/mix-generation

    MixGen: A New Multi-Modal Data Augmentation

    Language:Python107345
  • doc-doc/NExT-QA

    NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)

    Language:Python10322711
  • astra-vision/PODA

    [ICCV 2023] Official implementation of "PØDA: Prompt-driven Zero-shot Domain Adaptation"

    Language:Python996910