Awesome-self-supervised-multimodal-learning

A curated list of awesome self-supervised multimodal learning resources. Check our survey paper for details!

@article{zong2024self,
  title={Self-Supervised Multimodal Learning: A Survey},
  author={Zong, Yongshuo and Mac Aodha, Oisin and Hospedales, Timothy},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2024},
  publisher={IEEE}
}

Overview
Related Survey Papers
Objectives
Applications
Challenges
- Resources
- Robustness/Fairness
Summary of Common Multimodal Datasets

Overview

Taxonomy: Multimodal learning, which aims to understand and analyze information from multiple modalities, has achieved substantial progress in the supervised regime in recent years. However, the heavy dependence on data paired with expensive human annotations impedes scaling up models. Meanwhile, given the availability of large-scale unannotated data in the wild, self-supervised learning has become an attractive strategy to alleviate the annotation bottleneck. Building on these two directions, self-supervised multimodal learning (SSML) provides ways to learn from raw multimodal data. In this survey, we provide a comprehensive review of the state-of-the-art in SSML, in which we elucidate three major challenges intrinsic to self-supervised learning with multimodal data: (1) learning representations from multimodal data without labels, (2) fusion of different modalities, and (3) learning with unaligned data. We then detail existing solutions to these challenges. Specifically, we consider (1) objectives for learning from multimodal unlabeled data via self-supervision, (2) model architectures from the perspective of different multimodal fusion strategies, and (3) pair-free learning strategies for coarse-grained and fine-grained alignment.

Learning Paradigms: An example illustrating the self-supervised vision and language pretraining prior to downstream supervised learning for visual question answering is shown below. (a) supervised multimodal learning, and (b) self-supervised multimodal learning: Top, self-supervised pretraining without manual annotations; Bottom, supervised fine-tuning or linear readout for downstream tasks.

Related Survey Papers

Multimodal machine learning: A survey and taxonomy.
- IEEE TPAMI 2018 [paper]
Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions.
- arXiv 2022 [paper]
Deep multimodal learning: A survey on recent advances and trends.
- IEEE signal processing magazine 2017 [paper]
Multimodal research in vision and language: A review of current and emerging trends.
- Information Fusion 2022 [paper]
Self-Supervised Representation Learning: Introduction, advances, and challenges.
- IEEE Signal Processing Magazine 2022 [paper]
Self-supervised learning: Generative or contrastive.
- IEEE TKDE 2021 [paper]
Self-supervised visual feature learning with deep neural networks: A survey.
- IEEE TPAMI 2020 [paper]
Vision-language pre-training: Basics, recent advances, and future trends.
- arXiv 2022 [paper]

Objectives

Instance Discrimination

In the context of multimodal learning, instance discrimination often aims to determine whether samples from two input modalities are from the same instance, i.e., paired. By doing so, it attempts to align the representation space of the paired modalities while pushing the representation space of different instance pairs further apart. There are two types of instance discrimination objectives: contrastive and matching prediction, depending on how the input is sampled.

Learning transferable visual models from natural language supervision.
- ICML 2021 [paper]
Self-supervised multimodal versatile networks.
- NeurIPS 2020 [paper] [code]
End-to-end learning of visual representations from uncurated instructional videos.
- CVPR 2020 [paper] [code]
Scaling up visual and vision-language representation learning with noisy text supervision.
- ICML 2021 [paper]
Contrastive Multiview Coding.
- ECCV 2019 [paper] [code]
Audioclip: Extending Clip to Image, Text and Audio.
- ICASSP 2022 [paper] [code]
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding.
- EMNLP 2021 [paper] [code]
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval.
- Neurocomputing 2021 [paper] [code]
PointCLIP: Point Cloud Understanding by CLIP.
- CVPR 2021 [paper] [code]
Image-and-Language Understanding from Pixels Only.
- arXiv 2022 [paper] [code]
Scaling Language-Image Pre-training via Masking.
- arXiv 2022 [paper]
COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation.
- ICCV 2021 [paper] [code]
Slip: Self-supervision meets language-image pre-training.
- ECCV 2022 [paper] [code]
Crossclr: Cross-modal contrastive learning for multi-modal video representations.
- ICCV 2021 [paper] [code]
CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding.
- CVPR 2022 [paper] [code]
Learnable PINs: Cross-Modal Embeddings for Person Identity.
- ECCV 2018 [paper]
Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text.
- NeurIPS 2021 [paper] [code]
Learning Video Representations using Contrastive Bidirectional Transformer.
- arXiv [paper]
Learning representations from audio-visual spatial alignment.
- NeurIPS 2020 [paper] [code]
Sound Localization by Self-Supervised Time Delay Estimation.
- ECCV 2022 [paper] [code]
Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations.
- CVPR 2019 [paper] [code]
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings.
- ICCV 2019 [paper] [code]
Fine-grained Multi-Modal Self-Supervised Learning.
- BMVC 2021 [paper]
Self-supervised Feature Learning by Cross-modality and Cross-view Correspondences.
- CVPR Workshops 2020 [paper]
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization.
- NeurIPS 2018 [paper]
Audio-Visual Instance Discrimination with Cross-Modal Agreement.
- CVPR 2020 [paper] [code]
Look, Listen and Learn.
- ICCV 2017 [paper]
Objects that Sound.
- ECCV 2018 [paper]
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features.
- ECCV 2018 [paper] [code]
The Sound of Pixels.
- ECCV 2018 [paper] [code]
The Sound of Motions.
- ICCV 2019 [paper]
Music Gesture for Visual Sound Separation.
- CVPR 2020 [paper]
Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning.
- ACM MM 2020 [paper]

Clustering

Clustering methods assume that applying end-to-end trained clustering will lead to the grouping of the data by semantically salient characteristics. In practice, these methods iteratively predict the cluster assignments of the encoded representation, and use these predictions, also known as pseudo labels, as supervision signals to update the feature representation. Multimodal clustering provides the opportunity to learn multimodal representations and also improve conventional clustering by using each modality’s pseudolabels to supervise the other.

Self-Supervised Learning by Cross-Modal Audio-Video Clustering.
- NeurIPS 2019 [paper] [code]
Labelling unlabelled videos from scratch with multi-modal self-supervision.
- NeurIPS 2020 [paper] [code]
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction.
- ICLR 2021 [paper] [code]
u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality.
- NeurIPS 2022 [paper] [code]
Deep Multimodal Clustering for Unsupervised Audiovisual Learning.
- CVPR 2018 [paper] [code]
Self-labelling via simultaneous clustering and representation learning.
- ICLR 2020 [paper] [code]

Masked Prediction

The masked prediction task can be either performed in an auto-encoding (similar to BERT) or an auto-regressive approach (similar to GPT).

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning.
- arXiv 2022 [paper] [code]
CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations.
- EMNLP 2021 [paper] [code]
Jointly Learning Visual and Auditory Speech Representations from Raw Data.
- ICLR 2023 [paper] [code]
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks.
- arXiv 2022 [paper] [code]
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision.
- ICLR 2022 [paper] [code]
VideoBERT: A Joint Model for Video and Language Representation Learning.
- ICCV 2019 [paper] [code]
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks.
- ICLR 2023 [paper] [code]
VL-BEiT: Generative Vision-Language Pretraining.
- arXiv 2022 [paper] [code]
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation.
- arXiv 2021 [paper] [code]
SelfDoc: Self-Supervised Document Representation Learning.
- CVPR 2021 [paper]
Deep Bidirectional Language-Knowledge Graph Pretraining.
- NeurIPS 2022 [paper] [code]
ERNIE: Enhanced Language Representation with Informative Entities.
- ACL 2019 [paper] [code]
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding.
- ACL 2021 [paper] [code]
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions.
- NAACL 2021 [paper] [code]

Hybrid

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation.
- NeurIPS 2021 [paper] [code]
DM2C: Deep Mixed-Modal Clustering.
- NeurIPS 2019 [paper] [code]
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks.
- ECCV 2020 [paper] [code]
Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training.
- EMNLP 2020 [paper] [code]
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation.
- arXiv 2020 [paper] [code]
ActBERT: Learning Global-Local Video-Text Representations.
- CVPR 2020 [paper] [code]
MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound.
- CVPR 2022 [paper] [code]
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation.
- ICML 2022 [paper] [code]
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision.
- ICML 2021 [paper] [code]
UNITER: UNiversal Image-TExt Representation Learning.
- ECCV 2019 [paper] [code]
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts.
- NeurIPS 2022 [paper] [code]
FLAVA: A Foundational Language And Vision Alignment Model.
- CVPR 2021 [paper] [code]
Vlmixer: Unpaired vision-language pre-training via cross-modal cutmix.
- ICML 2022 [paper]
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks.
- NeurIPS 2019 [paper] [code]
Unsupervised Vision-and-Language Pretraining via Retrieval-based Multi-Granular Alignment.
- CVPR 2022 [paper] [code]
Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning.
- ACL 2021 [paper] [code]
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs.
- TACL 2020 [paper] [code]
Multimodal Deep Autoencoder for Human Pose Recovery.
- IEEE TIP 2015 [paper]
Self-supervised object detection from audio-visual correspondence.
- CVPR 2021 [paper]
Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos.
- ICCV 2021 [paper] [code]
Self-Supervised Learning of Audio-Visual Objects from Video.
- ECCV 2020 [paper] [code]
Coot: Cooperative hierarchical transformer for video-text representation learning.
- NeurIPS 2020 [paper] [code]
Unpaired Image Captioning via Scene Graph Alignments.
- ICCV 2019 [paper] [code]

Applications

State Representation Learning

State representation learning for control: An overview
- Neural Networks 2018 [paper]
Unsupervised Representation Learning in Deep Reinforcement Learning: A Review
- arXiv 2022 [paper]
Action-Conditional Video Prediction using Deep Networks in Atari Games
- NeurIPS 2015 [paper] [code]
Recurrent World Models Facilitate Policy Evolution
- NeurIPS 2018 [paper] [code]
Learning latent dynamics for planning from pixels
- ICML 2019 [paper] [code]
Learning to Poke by Poking: Experiential Learning of Intuitive Physics
- NeurIPS 2016 [paper]
Learning Predictive Representations for Deformable Objects Using Contrastive Estimation
- CoRL 2021 [paper] [code]

Healthcare

Multimodal biomedical AI.
- Nature Medicine 2022 [paper]
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text.
- EMNLP 2022 [paper] [code]
ContIG: Self-supervised multimodal contrastive learning for medical imaging with genetics.
- CVPR 2022 [paper] [code]
CoMIR: Contrastive multimodal image representation for registration.
- NeurIPS 2020 [paper] [code]
Contrastive learning of medical visual representations from paired images and text.
- arXiv 2020 [paper] [code]
GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition.
- ICCV 2021 [paper] [code]
Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning.
- Nature Biomedical Engineering 2022 [paper] [code]
Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports.
- Nature Machine Intelligence 2022 [paper] [code]

Remote Sensing

Multimodal remote sensing benchmark datasets for land cover classification with a shared and specific feature learning model.
- ISPRS Journal of Photogrammetry and Remote Sensing 2021 [paper] [code]
Self-Supervised SAR-Optical Data Fusion of Sentinel-1/-2 Images.
- IEEE Transactions on Geoscience and Remote Sensing 2022 [paper]
Semi-Supervised Learning for Joint SAR and Multispectral Land Cover Classification.
- IEEE Geoscience and Remote Sensing Letters 2021 [paper]
Self-Supervised Change Detection in Multiview Remote Sensing Images.
- IEEE Transactions on Geoscience and Remote Sensing 2021 [paper] [code]
Self-Supervised Multisensor Change Detection.
- IEEE Transactions on Geoscience and Remote Sensing 2021 [paper] [code]
Self-supervised Audiovisual Representation Learning for Remote Sensing Data.
- Int. J. Appl. Earth Obs. Geoinformation 2021 [paper] [code]

Machine Translation

A Survey of Multilingual Neural Machine Translation.
- ACM Computing Surveys 2019 [paper]
Unsupervised Machine Translation Using Monolingual Corpora Only.
- ICLR 2018 [paper] [code]
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation.
- TACL 2016 [paper]
Visual Grounding in Video for Unsupervised Word Translation.
- CVPR 2020 [paper] [code]
Multilingual Unsupervised NMT using Shared Encoder and Language-Specific Decoders.
- ACL 2019 [paper]
The Missing Ingredient in Zero-Shot Neural Machine Translation.
- arXiv 2019 [paper]

Auto-driving

Multi-modal Sensor Fusion for Auto Driving Perception: A Survey.
- arXiv 2022 [paper]
Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data.
- CVPR 2022 [paper] [code]
Advancing Self-supervised Monocular Depth Learning with Sparse LiDAR.
- CoRL 2021 [paper] [code]
There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge.
- CVPR 2021 [paper] [code]
Unsupervised Learning of Depth, Optical Flow and Pose with Occlusion from 3D Geometry.
- T-ITS 2022 [paper] [code]

Robotics

Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks.
- ICRA 2018 [paper]
Self-Supervised Visual Terrain Classification From Unsupervised Acoustic Feature Learning.
- IEEE Transactions on Robotics 2019 [paper]
Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks.
- CVPR 2020 [paper]
Two stream networks for self-supervised ego-motion estimation.
- CoRL 2019 [paper]
Connecting Touch and Vision via Cross-Modal Prediction.
- CVPR 2019 [paper]

Challenges

Resources

Contrastive Vision-Language Pre-training with Limited Resources.
- ECCV 2022 [paper] [code]
Beyond neural scaling laws: beating power law scaling via data pruning.
- NeurIPS 2022 [paper] [code]
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm.
- ICLR 2022 [paper] [code]

Robustness/Fairness

When and why Vision-Language Models behave like Bags-of-Words, and what to do about it?.
- ICLR 2023 [paper] [code]
Robustness Analysis of Video-Language Models Against Visual and Language Perturbations.
- NeurIPS Datasets and Benchmarks Track 2022 [paper] [code]
Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty.
- NeurIPS 2019 [paper] [code]
Badencoder: Backdoor attacks to pre-trained encoders in self-supervised learning.
- 2022 IEEE Symposium on Security and Privacy (SP) 2022 [paper] [code]
Are Multimodal Models Robust to Image and Text Perturbations?.
- arXiv 2022 [paper] [code]
A Study of Gender Impact in Self-supervised Models for Speech-to-Text Systems.
- Interspeech 2022 [paper]
Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications.
- arXiv 2021 [paper]
Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning.
- NeurIPS 2022 [paper] [code]
Multimodal datasets: misogyny, pornography, and malignant stereotypes.
- arXiv 2021 [paper]
On the opportunities and risks of foundation models.
- arXiv 2021 [paper]
Extracting Training Data from Diffusion Models.
- arXiv 2023 [paper]
Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models.
- Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP) 2022 [paper]
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers.
- TACL 2021 [paper] [code]
What makes for good views for contrastive learning?.
- NeurIPS 2020 [paper] [code]

Summary of Common Multimodal Datasets

Image-Text Datasets

Name	# Images	# Text	Domain	Task	Access	Github
COCO	>330000	>1.5M	Natural images	image captioning, image-text retrieval	Link	Github
Flickr30k	31,000	5 sentences for each image	Natural images	image captioning, image-text retrieval	Link	-
FlickrStyle10K	10,000	10,000	Natural images	image captioning (stylized), image-text retrieval	Link	Github
Flickr8k	8,000	5 for each	Natural images	image captioning, image-text retrieval	Link	Github
Flickr8k-CN	8,000	8,000	Natural images	image captioning, image-text retrieval	Link	Github
SentiCap	1671/1500	4892/3977	Natural images	image captioning (stylized), image-text retrieval	Link	-
SBU Captions	1M	1M	Natural images	image captioning, image-text retrieval	Link	Link
Conceptual Captions	3M	3M	Natural images	image captioning, image-text retrieval	Link	Github
AIC-ICC	210K	210K	Natural images	image captioning, image-text retrieval	Link	Github
Wikipedia	2866	2866	document	image captioning, image-text retrieval	?	Github
NUS-WIDE-10K	10K	10K	Natural images	image captioning, image-text retrieval	Link	-
Yelp	200,100	6,990,280	product review	summarization	Link	-
VQA v2.0	204,721	1105904/11,059,040 (Q/A)	Natural images	VQA	Link	-
ImageCLEF 2019 VQA-Med	3825	3825	Medicine	VQA	Link	Github
VCR	110k	290k/290k/290k (Q/A/Rationale)	natural	visual commonsense reasoning (VCR)	Link	Github
GD-VCR	328	886/886(Q/A)	Geo-Diverse	visual commonsense reasoning (VCR)	Link	Github
SNLI-VE	Details		Natural images	Visual Entailment	Link	Github
NLVR2	107,292	107,292	Natural images	natural language for visual reasoning	Link	Github
NLVR	92244	92244	synthetic images	natural language for visual reasoning	Link	Github
rendered SST2	~1k	~1k	image of text	optical character recognition (OCR)	Link	-
OCR-CC	1.4M	1.4M	Natural images	optical character recognition (OCR)	Link	Github
Hateful Memes	10k+	10k+	memes	optical character recognition (OCR)	Link	Github
CORD	1K	1k	document	OCR	Link	Github
RefCOCO+	19,992	141,564	Natural images	Visual Grounding	Link	Github

Image-Text-Audio Datasets

Name	# Images	# Text	Domain	Task	Access	Github
Localized Narratives	848,749	873,107	natural	Image captioning, Paragraph generation, VQA, Phrase grounding etc.	Link	Github
open image	0.6M	0.6M	natural	Image captioning, detection, segmentation, VQA, etc	Link	Github

Video-Text Datasets

Name	# Video / # clips	# Text	Domain	Task	link	Github
ActivityNet Captions	20k: 100k	100k	natural	Video Captioning, video-text retrieval	Link	Github
V2C	9k	27k	natural (human action)	Video Captioning, video-text retrieval	Link	Github
VATEX	41.3k	826k	natural	Video Captioning, video-text retrieval	Link	Github
YouCook2	2k:15.4k	15.4k	Cooking	Video Captioning, video-text retrieval	Link	Github
Charades	10k:10k	27.8k	Indoor activity	Video Captioning, video-text retrieval, action recognition	Link	-
MSR-VTT	7k:10k	200k	natural	Video Captioning, video-text retrieval	Link	Github
MSVD	2k:2k	70k	natural	Video Captioning, video-text retrieval	-	-
HowTo100M	1.2M: 136M	136M	instructional video	Video Captioning, video-text retrieval, action locolization	Link	-
TGIF	102k: 102k	126k	animated GIFs	Video Captioning, video-text retrieval	Link	Github
TACoS-MLevel	185:25k	75k	Cooking	Video Captioning, video-text retrieval	Link	-
CrossTask	4.7K:-	4.7k	instructional	Temporal action localization	Link	Github
MiningYoutube	20k:200k	200k	Cooking	Temporal action localization	Link	Github
COIN	11,827:-	46,354	12 different domains	Action Segmentation	Link	Github
Breakfast	-:11267	11267	cooking	Action Segmentation	Link	-
LSMDC	200:128k	128k	Movie	Video Captioning, video-text retrieval	Link	-
HOMAGE	1.75K		Indoor activity	Activity Classification	Link	Github

Video-Audio Datasets

Name	# Video-audio	# utterance	Domain	Task	Access	Github
SoundNet	2M		natural	Audio-visual correspondence	Link	Github
MUSIC	714		music instruments	Audio-visual correspondence	Link	Github
AVSpeech	290k		Person	Audio-visual correspondence	Link	Github
URMP	44		music instruments	Audio-visual correspondence	Link	-
AV-Bench	v1 ~5k, v2 ~7k		natural	Audio-Visual Correspondence (AVC), Audio-Visual Event Localization (AVEL) and video parsing (AVVP), Sound Source Localization (SSL), etc.	Link	Github
AVE	4143		natural	temporal localization	Link	Github
360° video	1146		camera	Spatial Audio generation	Link	Github
openpose			person	Audio-visual correspondence, music-to-video generation	Link	Github
LRS2	-	144481	person	speech recognition, lips reading	Link	Github
LRS3	9506	151819	person	speech recognition, lips reading	Link	-

Point Cloud Datasets

Name	# mesh	Domain	Task	Access	Github
ModelNet40	12,311	CAD models	Classification, reconstruction	Link	Github
ShapeNet	220,000	3D models	Classification, reconstruction	Link	-
ScanObjectNN	2902	real-world point cloud	Classification, reconstruction	Link	Github

Image-Ridar Datasets

Name	# images	# points (M)	Domain	Task	link	Github
Eigen split KITTTI	7481+7518	1799	auto driving	detection	Link	-
nuScenes			auto driving	3D detection and tracking	Link	Github
SemanticKITTI	23201+20351	4549	auto driving	segmentation	Link	Github

Contribute

PR welcome using the following markdown format:

- Paper Name. 
  - *Conference Year*. [[paper]](link) [[code]](link)

ys-zong/awesome-self-supervised-multimodal-learning