/Awesome-CLIP

Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).

Awesome CLIP

This repo collects the research resources based on CLIP (Contrastive Language-Image Pre-Training) proposed by OpenAI. If you would like to contribute, please open an issue.

CLIP

  • Learning Transferable Visual Models From Natural Language Supervision [paper][code]
  • CLIP: Connecting Text and Images [blog]
  • Multimodal Neurons in Artificial Neural Networks [blog]

Training

  • OpenCLIP (3rd-party, PyTorch) [code]
  • Train-CLIP (3rd-party, PyTorch) [code]
  • Paddle-CLIP (3rd-party, PaddlePaddle) [code]

Applications

GAN

  • VQGAN-CLIP [code]
  • StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery [paper][code]
  • CLIP Guided Diffusion [code]
  • CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions [paper]
  • TargetCLIP: Image-Based CLIP-Guided Essence Transfer [paper][code]
  • DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation [paper][code]
  • clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP paper][code]

Object Detection

  • Roboflow Zero-shot Object Tracking [code]
  • Zero-Shot Detection via Vision and Language Knowledge Distillation [paper][code]
  • Crop-CLIP [code]
  • Detic: Detecting Twenty-thousand Classes using Image-level Supervision [paper][code]
  • CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks [paper]
  • SLIP: Self-supervision meets Language-Image Pre-training [paper][code]
  • ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension [paper][code]

Information Retrieval

  • Unsplash Image Search [code]
  • CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval [paper][code]
  • Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling [paper][code]
  • Natural Language YouTube Search [code]
  • CLIP-as-service: Embed images and sentences into fixed-length vectors with CLIP [doc][code]
  • clip-retrieval [code]
  • A CLIP-Hitchhiker’s Guide to Long Video Retrieval [paper]code]
  • CLIP2Video: Mastering Video-Text Retrieval via Image CLIP [paper][code]
  • X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval [paper][code]

Representation Learning

  • Wav2CLIP: Learning Robust Audio Representations From CLIP [code]
  • CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotation [paper]
  • RegionCLIP: Region-based Language-Image Pretraining [paper][code]
  • CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification [paper]
  • DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [paper][code]
  • CyCLIP: Cyclic Contrastive Language-Image Pretraining [paper][code]
  • CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment [paper][code]
  • DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection [paper][code]
  • UniCLIP: Unified Framework for Contrastive Language–Image Pre-training [paper]
  • SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model [paper][code]
  • Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese [paper][code]
  • PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining [paper]
  • Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training [paper][code]

Text-to-3D Generation

  • CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation [paper]
  • Text2Mesh: Text-Driven Neural Stylization for Meshes [paper][code]
  • CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP [paper]
  • CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders [paper]
  • CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields [paper][code]
  • MotionCLIP: Exposing Human Motion Generation to CLIP Space [paper][code]
  • AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars [paper][code]

Text-to-Image Generation

  • Big Sleep: A simple command line tool for text to image generation [code]
  • Deep Daze: A simple command line tool for text to image generation [code]
  • CLIP-CLOP: CLIP-Guided Collage and Photomontage [paper][code]
  • CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP [paper][code]

Prompt Learning

  • Learning to Prompt for Vision-Language Models [paper][code]
  • Conditional Prompt Learning for Vision-Language Models [paper][code]
  • Prompt-aligned Gradient for Prompt Tuning [paper][code]
  • CLIP-Adapter: Better Vision-Language Models with Feature Adapters [paper][code]
  • Learning to Compose Soft Prompts for Compositional Zero-Shot Learning [paper] [code]

Video Understanding

  • VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding [code]
  • FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks [paper][code]
  • Frozen CLIP Models are Efficient Video Learners [paper][code]
  • Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization [paper][code]
  • MovieCLIP: Visual Scene Recognition in Movies [paper]

Image Captioning

  • CLIP prefix captioning [code]
  • CLIPScore: A Reference-free Evaluation Metric for Image Captioning [paper]
  • ClipCap: CLIP Prefix for Image Captioning [paper][code]
  • Text-Only Training for Image Captioning using Noise-Injected CLIP [paper][code]
  • Fine-grained Image Captioning with CLIP Reward [paper][code]

Image Editing

  • HairCLIP: Design Your Hair by Text and Reference Image [code]
  • CLIPstyler: Image Style Transfer with a Single Text Condition [paper][code]
  • CLIPasso: Semantically-Aware Object Sketching [paper][code]
  • Image-based CLIP-Guided Essence Transfer [paper][code]
  • CLIPDraw: Synthesize drawings to match a text prompt! [paper][code]
  • CLIP-CLOP: CLIP-Guided Collage and Photomontage [paper][code]
  • Towards Counterfactual Image Manipulation via CLIP [paper][code]

Image Segmentation

  • CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation [paper][code]
  • Image Segmentation Using Text and Image Prompts [paper][code]
  • Extract Free Dense Labels from CLIP [paper][code]
  • Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP [paper][code]

3D Recognition

  • PointCLIP: Point Cloud Understanding by CLIP [Paper][code]
  • CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training [paper][code]
  • MotionCLIP: Exposing Human Motion Generation to CLIP Space [paper][code]

Language Tasks

  • CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment [paper]

Object Navigation

  • CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration [paper]

Audio

  • AudioCLIP: Extending CLIP to Image, Text and Audio [code]
  • Wav2CLIP: Learning Robust Audio Representations from Clip [paper][code]
  • AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization

Localization

  • Adapting CLIP For Phrase Localization Without Further Training [paper][code]

Others

  • Multilingual-CLIP [code]
  • CLIP (With Haiku + Jax!) [code]
  • CLIP-Event: Connecting Text and Images with Event Structures [paper][code]
  • How Much Can CLIP Benefit Vision-and-Language Tasks? [paper]
  • DeCLIP: Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm [paper][code]
  • Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision [paper][code]
  • CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning [paper][code]
  • CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory [paper][code]
  • CLIP-Event: Connecting Text and Images with Event Structures [paper][code]

Acknowledgment

Inspired by Awesome Visual-Transformer.