Awesome-self-supervised-multimodal-learning

PRs WelcomeAwesome

A curated list of awesome self-supervised multimodal learning resources. Check our survey paper for details!

@article{zong2024self,
  title={Self-Supervised Multimodal Learning: A Survey},
  author={Zong, Yongshuo and Mac Aodha, Oisin and Hospedales, Timothy},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2024},
  publisher={IEEE}
}

Table of Contents

Overview

Taxonomy: Multimodal learning, which aims to understand and analyze information from multiple modalities, has achieved substantial progress in the supervised regime in recent years. However, the heavy dependence on data paired with expensive human annotations impedes scaling up models. Meanwhile, given the availability of large-scale unannotated data in the wild, self-supervised learning has become an attractive strategy to alleviate the annotation bottleneck. Building on these two directions, self-supervised multimodal learning (SSML) provides ways to learn from raw multimodal data. In this survey, we provide a comprehensive review of the state-of-the-art in SSML, in which we elucidate three major challenges intrinsic to self-supervised learning with multimodal data: (1) learning representations from multimodal data without labels, (2) fusion of different modalities, and (3) learning with unaligned data. We then detail existing solutions to these challenges. Specifically, we consider (1) objectives for learning from multimodal unlabeled data via self-supervision, (2) model architectures from the perspective of different multimodal fusion strategies, and (3) pair-free learning strategies for coarse-grained and fine-grained alignment.

Learning Paradigms: An example illustrating the self-supervised vision and language pretraining prior to downstream supervised learning for visual question answering is shown below. (a) supervised multimodal learning, and (b) self-supervised multimodal learning: Top, self-supervised pretraining without manual annotations; Bottom, supervised fine-tuning or linear readout for downstream tasks.

Related Survey Papers

  • Multimodal machine learning: A survey and taxonomy.

  • Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions.

  • Deep multimodal learning: A survey on recent advances and trends.

    • IEEE signal processing magazine 2017 [paper]
  • Multimodal research in vision and language: A review of current and emerging trends.

  • Self-Supervised Representation Learning: Introduction, advances, and challenges.

    • IEEE Signal Processing Magazine 2022 [paper]
  • Self-supervised learning: Generative or contrastive.

  • Self-supervised visual feature learning with deep neural networks: A survey.

  • Vision-language pre-training: Basics, recent advances, and future trends.

Objectives

Instance Discrimination

In the context of multimodal learning, instance discrimination often aims to determine whether samples from two input modalities are from the same instance, i.e., paired. By doing so, it attempts to align the representation space of the paired modalities while pushing the representation space of different instance pairs further apart. There are two types of instance discrimination objectives: contrastive and matching prediction, depending on how the input is sampled.

  • Learning transferable visual models from natural language supervision.

  • Self-supervised multimodal versatile networks.

  • End-to-end learning of visual representations from uncurated instructional videos.

  • Scaling up visual and vision-language representation learning with noisy text supervision.

  • Contrastive Multiview Coding.

  • Audioclip: Extending Clip to Image, Text and Audio.

  • VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding.

  • CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval.

  • PointCLIP: Point Cloud Understanding by CLIP.

  • Image-and-Language Understanding from Pixels Only.

  • Scaling Language-Image Pre-training via Masking.

  • COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation.

  • Slip: Self-supervision meets language-image pre-training.

  • Crossclr: Cross-modal contrastive learning for multi-modal video representations.

  • CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding.

  • Learnable PINs: Cross-Modal Embeddings for Person Identity.

  • Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text.

  • Learning Video Representations using Contrastive Bidirectional Transformer.

  • Learning representations from audio-visual spatial alignment.

  • Sound Localization by Self-Supervised Time Delay Estimation.

  • Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations.

  • Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings.

  • Fine-grained Multi-Modal Self-Supervised Learning.

  • Self-supervised Feature Learning by Cross-modality and Cross-view Correspondences.

  • Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization.

  • Audio-Visual Instance Discrimination with Cross-Modal Agreement.

  • Look, Listen and Learn.

  • Objects that Sound.

  • Audio-Visual Scene Analysis with Self-Supervised Multisensory Features.

  • The Sound of Pixels.

  • The Sound of Motions.

  • Music Gesture for Visual Sound Separation.

  • Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning.

Clustering

Clustering methods assume that applying end-to-end trained clustering will lead to the grouping of the data by semantically salient characteristics. In practice, these methods iteratively predict the cluster assignments of the encoded representation, and use these predictions, also known as pseudo labels, as supervision signals to update the feature representation. Multimodal clustering provides the opportunity to learn multimodal representations and also improve conventional clustering by using each modality’s pseudolabels to supervise the other.

  • Self-Supervised Learning by Cross-Modal Audio-Video Clustering.

  • Labelling unlabelled videos from scratch with multi-modal self-supervision.

  • Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction.

  • u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality.

  • Deep Multimodal Clustering for Unsupervised Audiovisual Learning.

  • Self-labelling via simultaneous clustering and representation learning.

Masked Prediction

The masked prediction task can be either performed in an auto-encoding (similar to BERT) or an auto-regressive approach (similar to GPT).

  • VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning.

  • CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations.

  • Jointly Learning Visual and Auditory Speech Representations from Raw Data.

  • Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks.

  • SimVLM: Simple Visual Language Model Pretraining with Weak Supervision.

  • VideoBERT: A Joint Model for Video and Language Representation Learning.

  • Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks.

  • VL-BEiT: Generative Vision-Language Pretraining.

  • OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation.

  • SelfDoc: Self-Supervised Document Representation Learning.

  • Deep Bidirectional Language-Knowledge Graph Pretraining.

  • ERNIE: Enhanced Language Representation with Informative Entities.

  • VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding.

  • Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions.

Hybrid

  • Align before Fuse: Vision and Language Representation Learning with Momentum Distillation.

  • DM2C: Deep Mixed-Modal Clustering.

  • Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks.

  • Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training.

  • UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation.

  • ActBERT: Learning Global-Local Video-Text Representations.

  • MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound.

  • BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation.

  • ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision.

  • UNITER: UNiversal Image-TExt Representation Learning.

  • VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts.

  • FLAVA: A Foundational Language And Vision Alignment Model.

  • Vlmixer: Unpaired vision-language pre-training via cross-modal cutmix.

  • ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks.

  • Unsupervised Vision-and-Language Pretraining via Retrieval-based Multi-Granular Alignment.

  • Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning.

  • Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs.

  • Multimodal Deep Autoencoder for Human Pose Recovery.

  • Self-supervised object detection from audio-visual correspondence.

  • Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos.

  • Self-Supervised Learning of Audio-Visual Objects from Video.

  • Coot: Cooperative hierarchical transformer for video-text representation learning.

  • Unpaired Image Captioning via Scene Graph Alignments.

Applications

State Representation Learning

  • State representation learning for control: An overview

  • Unsupervised Representation Learning in Deep Reinforcement Learning: A Review

  • Action-Conditional Video Prediction using Deep Networks in Atari Games

  • Recurrent World Models Facilitate Policy Evolution

  • Learning latent dynamics for planning from pixels

  • Learning to Poke by Poking: Experiential Learning of Intuitive Physics

  • Learning Predictive Representations for Deformable Objects Using Contrastive Estimation

Healthcare

  • Multimodal biomedical AI.

  • MedCLIP: Contrastive Learning from Unpaired Medical Images and Text.

  • ContIG: Self-supervised multimodal contrastive learning for medical imaging with genetics.

  • CoMIR: Contrastive multimodal image representation for registration.

  • Contrastive learning of medical visual representations from paired images and text.

  • GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition.

  • Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning.

  • Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports.

Remote Sensing

  • Multimodal remote sensing benchmark datasets for land cover classification with a shared and specific feature learning model.

    • ISPRS Journal of Photogrammetry and Remote Sensing 2021 [paper] [code]
  • Self-Supervised SAR-Optical Data Fusion of Sentinel-1/-2 Images.

    • IEEE Transactions on Geoscience and Remote Sensing 2022 [paper]
  • Semi-Supervised Learning for Joint SAR and Multispectral Land Cover Classification.

    • IEEE Geoscience and Remote Sensing Letters 2021 [paper]
  • Self-Supervised Change Detection in Multiview Remote Sensing Images.

    • IEEE Transactions on Geoscience and Remote Sensing 2021 [paper] [code]
  • Self-Supervised Multisensor Change Detection.

    • IEEE Transactions on Geoscience and Remote Sensing 2021 [paper] [code]
  • Self-supervised Audiovisual Representation Learning for Remote Sensing Data.

Machine Translation

  • A Survey of Multilingual Neural Machine Translation.

    • ACM Computing Surveys 2019 [paper]
  • Unsupervised Machine Translation Using Monolingual Corpora Only.

  • Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation.

  • Visual Grounding in Video for Unsupervised Word Translation.

  • Multilingual Unsupervised NMT using Shared Encoder and Language-Specific Decoders.

  • The Missing Ingredient in Zero-Shot Neural Machine Translation.

Auto-driving

  • Multi-modal Sensor Fusion for Auto Driving Perception: A Survey.

  • Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data.

  • Advancing Self-supervised Monocular Depth Learning with Sparse LiDAR.

  • There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge.

  • Unsupervised Learning of Depth, Optical Flow and Pose with Occlusion from 3D Geometry.

Robotics

  • Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks.

  • Self-Supervised Visual Terrain Classification From Unsupervised Acoustic Feature Learning.

    • IEEE Transactions on Robotics 2019 [paper]
  • Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks.

  • Two stream networks for self-supervised ego-motion estimation.

  • Connecting Touch and Vision via Cross-Modal Prediction.

Challenges

Resources

  • Contrastive Vision-Language Pre-training with Limited Resources.

  • Beyond neural scaling laws: beating power law scaling via data pruning.

  • Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm.

Robustness/Fairness

  • When and why Vision-Language Models behave like Bags-of-Words, and what to do about it?.

  • Robustness Analysis of Video-Language Models Against Visual and Language Perturbations.

  • Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty.

  • Badencoder: Backdoor attacks to pre-trained encoders in self-supervised learning.

    • 2022 IEEE Symposium on Security and Privacy (SP) 2022 [paper] [code]
  • Are Multimodal Models Robust to Image and Text Perturbations?.

  • A Study of Gender Impact in Self-supervised Models for Speech-to-Text Systems.

  • Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications.

  • Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning.

  • Multimodal datasets: misogyny, pornography, and malignant stereotypes.

  • On the opportunities and risks of foundation models.

  • Extracting Training Data from Diffusion Models.

  • Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models.

    • Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP) 2022 [paper]
  • Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers.

  • What makes for good views for contrastive learning?.

Summary of Common Multimodal Datasets

Image-Text Datasets

Name # Images # Text Domain Task Access Github
COCO >330000 >1.5M Natural images image captioning, image-text retrieval Link Github
Flickr30k 31,000 5 sentences for each image Natural images image captioning, image-text retrieval Link -
FlickrStyle10K 10,000 10,000 Natural images image captioning (stylized), image-text retrieval Link Github
Flickr8k 8,000 5 for each Natural images image captioning, image-text retrieval Link Github
Flickr8k-CN 8,000 8,000 Natural images image captioning, image-text retrieval Link Github
SentiCap 1671/1500 4892/3977 Natural images image captioning (stylized), image-text retrieval Link -
SBU Captions 1M 1M Natural images image captioning, image-text retrieval Link Link
Conceptual Captions 3M 3M Natural images image captioning, image-text retrieval Link Github
AIC-ICC 210K 210K Natural images image captioning, image-text retrieval Link Github
Wikipedia 2866 2866 document image captioning, image-text retrieval ? Github
NUS-WIDE-10K 10K 10K Natural images image captioning, image-text retrieval Link -
Yelp 200,100 6,990,280 product review summarization Link -
VQA v2.0 204,721 1105904/11,059,040 (Q/A) Natural images VQA Link -
ImageCLEF 2019 VQA-Med 3825 3825 Medicine VQA Link Github
VCR 110k 290k/290k/290k (Q/A/Rationale) natural visual commonsense reasoning (VCR) Link Github
GD-VCR 328 886/886(Q/A) Geo-Diverse visual commonsense reasoning (VCR) Link Github
SNLI-VE Details Natural images Visual Entailment Link Github
NLVR2 107,292 107,292 Natural images natural language for visual reasoning Link Github
NLVR 92244 92244 synthetic images natural language for visual reasoning Link Github
rendered SST2 ~1k ~1k image of text optical character recognition (OCR) Link -
OCR-CC 1.4M 1.4M Natural images optical character recognition (OCR) Link Github
Hateful Memes 10k+ 10k+ memes optical character recognition (OCR) Link Github
CORD 1K 1k document OCR Link Github
RefCOCO+ 19,992 141,564 Natural images Visual Grounding Link Github

Image-Text-Audio Datasets

Name # Images # Text Domain Task Access Github
Localized Narratives 848,749 873,107 natural Image captioning, Paragraph generation, VQA, Phrase grounding etc. Link Github
open image 0.6M 0.6M natural Image captioning, detection, segmentation, VQA, etc Link Github

Video-Text Datasets

Name # Video / # clips # Text Domain Task link Github
ActivityNet Captions 20k: 100k 100k natural Video Captioning, video-text retrieval Link Github
V2C 9k 27k natural (human action) Video Captioning, video-text retrieval Link Github
VATEX 41.3k 826k natural Video Captioning, video-text retrieval Link Github
YouCook2 2k:15.4k 15.4k Cooking Video Captioning, video-text retrieval Link Github
Charades 10k:10k 27.8k Indoor activity Video Captioning, video-text retrieval, action recognition Link -
MSR-VTT 7k:10k 200k natural Video Captioning, video-text retrieval Link Github
MSVD 2k:2k 70k natural Video Captioning, video-text retrieval - -
HowTo100M 1.2M: 136M 136M instructional video Video Captioning, video-text retrieval, action locolization Link -
TGIF 102k: 102k 126k animated GIFs Video Captioning, video-text retrieval Link Github
TACoS-MLevel 185:25k 75k Cooking Video Captioning, video-text retrieval Link -
CrossTask 4.7K:- 4.7k instructional Temporal action localization Link Github
MiningYoutube 20k:200k 200k Cooking Temporal action localization Link Github
COIN 11,827:- 46,354 12 different domains Action Segmentation Link Github
Breakfast -:11267 11267 cooking Action Segmentation Link -
LSMDC 200:128k 128k Movie Video Captioning, video-text retrieval Link -
HOMAGE 1.75K Indoor activity Activity Classification Link Github

Video-Audio Datasets

Name # Video-audio # utterance Domain Task Access Github
SoundNet 2M natural Audio-visual correspondence Link Github
MUSIC 714 music instruments Audio-visual correspondence Link Github
AVSpeech 290k Person Audio-visual correspondence Link Github
URMP 44 music instruments Audio-visual correspondence Link -
AV-Bench v1 ~5k, v2 ~7k natural Audio-Visual Correspondence (AVC), Audio-Visual Event Localization (AVEL) and video parsing (AVVP), Sound Source Localization (SSL), etc. Link Github
AVE 4143 natural temporal localization Link Github
360° video 1146 camera Spatial Audio generation Link Github
openpose person Audio-visual correspondence, music-to-video generation Link Github
LRS2 - 144481 person speech recognition, lips reading Link Github
LRS3 9506 151819 person speech recognition, lips reading Link -

Point Cloud Datasets

Name # mesh Domain Task Access Github
ModelNet40 12,311 CAD models Classification, reconstruction Link Github
ShapeNet 220,000 3D models Classification, reconstruction Link -
ScanObjectNN 2902 real-world point cloud Classification, reconstruction Link Github

Image-Ridar Datasets

Name # images # points (M) Domain Task link Github
Eigen split KITTTI 7481+7518 1799 auto driving detection Link -
nuScenes auto driving 3D detection and tracking Link Github
SemanticKITTI 23201+20351 4549 auto driving segmentation Link Github

Contribute

PR welcome using the following markdown format:

- Paper Name. 
  - *Conference Year*. [[paper]](link) [[code]](link)