A Survey on Self-Supervised Learning

This repository provides a brief summary of algorithms from our review paper A Survey of Self-Supervised Learning from Multiple Perspectives: Algorithms, Theory, Applications and Future Trends.

SSL research breakthroughs in CV have been achieved in recent years. In this work, we therefore mainly include SSL research derived from the CV community in recent years, especially classic and influential research results. The objectives of this review are to explain what SSL is, its categories and subcategories, how it differs and relates to other machine learning paradigms, and its theoretical underpinnings. We present an up-to-date and comprehensive review of the frontiers of visual SSL and divide visual SSL into three parts: context-based, contrastive, and generative SSL, in the hope of sorting the trends for researchers.

See our paper for more details.

Algorithms

Context Based Methods

  • (Rotation): Unsupervised representation learning by predicting image rotations. [paper] [code]

  • (Colorization): Colorful Image Colorization. [paper] [code]

  • (Jigsaw): Scaling and Benchmarking Self-Supervised Visual Representation Learning. [paper] [code]

Contrastive Learning

  • CL methods based on negative examples:

    • (MoCo v1): Momentum Contrast for Unsupervised Visual Representation Learning. [paper] [code]

    • (MoCo v2): Improved Baselines with Momentum Contrastive Learning. [paper] [code]

    • (MoCo v3): An Empirical Study of Training Self-Supervised Vision Transformers. [paper] [code]

    • (SimCLR V1): A Simple Framework for Contrastive Learning of Visual Representations. [paper] [code]

    • (SimCLR V2): Big Self-Supervised Models are Strong Semi-Supervised Learners. [paper] [code]

  • CL methods based on self-distillation:

    • (BYOL): Bootstrap Your Own Latent A New Approach to Self-Supervised Learning. [paper] [code]

    • (SimSiam): Exploring Simple Siamese Representation Learning A New Approach to Self-Supervised Learning. [paper] [code]

  • CL methods based on feature decorrelation:

    • (Barlow Twins): Barlow Twins: Self-Supervised Learning via Redundancy Reduction. [paper] [code]

    • (VICReg): Vicreg: Variance-invariancecovariance regularization for self-supervised learning. [paper] [code]

  • Others:

    • methods that combinate CL and MIM
    • ...

Generative Algorithms

  • (BEiT): Beit: Bert pre-training of image transformers. [paper] [code]

  • (MAE): Masked Autoencoders Are Scalable Vision Learners. [paper] [code]

  • (iBOT): iBOT: Image BERT Pre-Training with Online Tokenizer. [paper] [code]

  • (CAE): Context Autoencoder for Self-Supervised Representation Learning. [paper] [code]

  • (SimMIM): SimMIM: a Simple Framework for Masked Image Modeling. [paper] [code]

Applications

4.1 Sequential data

Natural language processing (NLP)

  • (Skip-Gram): Distributed Representations of Words and Phrases and their Compositionality. [paper] [code]

  • (BERT): BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. [paper] [code]

  • (GPT): Improving Language Understanding by Generative Pre-Training. [paper]

Sequential models for image processing and computer vision

  • (CPC): Representation learning with contrastive predictive coding. [paper]

  • (Image GPT): Distributed Representations of Words and Phrases and their Compositionality. [paper] [code]

4.2 Image processing and computer vision

video

  • (MIL-NCE): End-to-End Learning of Visual Representations From Uncurated Instructional Videos. [paper] [code]

  • Unsupervised Learning of Visual Representations using Videos. [paper]

  • Unsupervised Learning of Video Representations using LSTMs. [paper] [code]

1. Temporal information in videos:

The order of the frames:

  • Shuffle and Learn: Unsupervised Learning using Temporal Order Verification. [paper]

  • Self-Supervised Video Representation Learning With Odd-One-Out Networks. [paper]

Video playing direction:

  • Learning and Using the Arrow of Time. [paper]

Video playing speed:

  • (SpeedNet): SpeedNet: Learning the Speediness in Videos. [paper]

2. Motion of objects such as optical flow:

  • (DynamoNet): DynamoNet: Dynamic Action and Motion Network. [paper]

  • (CoCLR): Self-supervised Co-training for Video Representation Learning. [paper] [code]

3. Multi-modal(ality) data such as RGB, audio, and narrations

  • Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization. [paper]

  • Time-Contrastive Networks: Self-Supervised Learning from Video. [paper]

4. Spatial-temporal coherence of objects such as colours and shapes

  • Learning Correspondence from the Cycle-Consistency of Time. [paper]

  • (VCP): Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning. [paper]

  • Joint-task Self-supervised Learning for Temporal Correspondence. [paper] [code]

Other fields

  • medical field: Preservational Learning Improves Self-supervised Medical Image Models by Reconstructing Diverse Contexts. [paper] [code]

  • medical image segmentation: Contrastive learning of global and local features for medical image segmentation with limited annotations. [paper] [code]

  • 3D medical image analysis: Rubik’s Cube+: A self-supervised feature learning framework for 3D medical image analysis. [paper]

Contact

If you have any suggestions or find our work helpful, feel free to contact us

Email: {guijie,tchen}@seu.edu.cn