Self-Supervised Pretraining

This is a paper list of self-supervised pretraining method. All papers are listed in order of their appearance in arxiv.

In addition, papers are also categorized according to different topics. You can click on the links below to get related papers on the topics you are interested in.

All Papers

2020

[MoCov1] 🌟 Momentum Contrast for Unsupervised Visual Representation Learningn | [CVPR'20] |[paper] [code]

MoCov1 Arch
[SimCLRv1] 🌟 A Simple Framework for Contrastive Learning of Visual Representations | [ICML'20] | [paper] [code]

SimCLRv1 Arch
[MoCov2] Improved Baselines with Momentum Contrastive Learning | [arxiv'20] | [paper] [code]

MoCov2 Arch
[BYOL] Bootstrap your own latent: A new approach to self-supervised Learning | [NIPS'20] | [paper] [code]

BYOL Arch
[SimCLRv2] Big Self-Supervised Models are Strong Semi-Supervised Learners | [NIPS'20] | [paper] [code]

SimCLRv2 Arch
[SwAV] Unsupervised Learning of Visual Features by Contrasting Cluster Assignments | [NIPS'20] | [paper] [code]

SwAV Arch
[RELICv1] Representation Learning via Invariant Causal Mechanisms | [ICLR'21] | [paper]
[CompRess] CompRess: Self-Supervised Learning by Compressing Representations | [NIPS'20] | [paper] [code]

CompRess Arch
[DenseCL] Dense Contrastive Learning for Self-Supervised Visual Pre-Training | [CVPR'21] | [paper] [code]

DenseCL Arch
[SimSiam] 🌟 Exploring Simple Siamese Representation Learning | [CVPR'21] [paper] [code]

SimSiam Arch

2021

[SEED] SEED: Self-supervised Distillation For Visual Representation | [ICLR'21] | [paper] [code]

SEED Arch
[ALIGN] Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | [ICML'21] | [paper]

ALIGN Arch
[CLIP] 🌟 Learning Transferable Visual Models From Natural Language Supervision | [ICML'21] | [paper] [code]

CLIP Arch
[Barlow Twins] Barlow Twins: Self-Supervised Learning via Redundancy Reduction | [ICML'21] | [paper] [code]

Barlow Twins Arch
[S3L] Rethinking Self-Supervised Learning: Small is Beautiful | [arxiv'21] | [paper] [code]
[MoCov3] 🌟 An Empirical Study of Training Self-Supervised Vision Transformers | [ICCV'21] | [paper] [code]
[DisCo] DisCo: Remedy Self-supervised Learning on Lightweight Models with Distilled Contrastive Learning | [ECCV'22] | [paper] [code]

DisCo Arch
[DoGo] Distill on the Go: Online knowledge distillation in self-supervised learning | [CVPRW'21] | [paper] [code]

DoGo Arch
[DINOv1] 🌟 Emerging Properties in Self-Supervised Vision Transformers | [ICCV'21] |[paper] [code]

DINOv1 Arch
[VICReg] VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning | [ICLR'22] | [paper] [code]

VICReg Arch
[MST] MST: Masked Self-Supervised Transformer for Visual Representation | [NIPS'21] | [paper]

MST Arch
[BEiTv1] 🌟 BEiT: BERT Pre-Training of Image Transformers | [ICLR'22] | [paper] [code]

BEiTv1 Arch
[SimDis] Simple Distillation Baselines for Improving Small Self-supervised Models | [ICCVW'21] | [paper] [code]

SimDis Arch
[OSS] Unsupervised Representation Transfer for Small Networks: I Believe I Can Distill On-the-Fly | [NIPS'21] | [paper]

OSS Arch
[BINGO] Bag of Instances Aggregation Boosts Self-supervised Distillation | [ICLR'22] | [paper] [code]

BINGO Arch
[SSL-Small] On the Efficacy of Small Self-Supervised Contrastive Models without Distillation Signals | [AAAI'22] | [paper] [code]
[C-BYOL/C-SimLCR] Compressive Visual Representations | [NIPS'21] | [paper] [code]
[MAE] 🌟 Masked Autoencoders Are Scalable Vision Learners | [CVPR'22] | [paper] [code]

MAE Arch
[iBOT] iBOT: Image BERT Pre-Training with Online Tokenizer | [ICLR'22] | [paper] [code]

iBOT Arch
[SimMIM] 🌟 SimMIM: A Simple Framework for Masked Image Modeling | [CVPR'22] | [paper] [code]

SimMIM Arch
[PeCo] PeCo：Perceptual Codebook for BERT Pre-training of Vision Transformers | [AAAI'23] | [paper]

PeCo Arch
[MaskFeat] Masked Feature Prediction for Self-Supervised Visual Pre-Training | [CVPR'22] | [paper] [code]

MaskFeat Arch

2022

[RELICv2] Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? | [arxiv'22] | [paper]

RELICv2 Arch
[SimReg] SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation | [BMVC'21] | [paper] [code]

SimReg Arch
[RePre] RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training | [arxiv'22] | [paper]

RePre Arch
[CAEv1] Context Autoencoder for Self-Supervised Representation Learning | [arxiv'22] | [paper] [code]

CAEv1 Arch
[CIM] Corrupted Image Modeling for Self-Supervised Visual Pre-Training | [ICLR'23] | [paper]

CIM Arch
[MVP] MVP: Multimodality-guided Visual Pre-training | [ECCV'22] | [paper]

MVP Arch
[ConvMAE] ConvMAE: Masked Convolution Meets Masked Autoencoders | [NIPS'22] | [paper] [code]

ConvMAE Arch
[ConMIM] Masked Image Modeling with Denoising Contrast | [ICLR'23] | [paper] [code]

ConMIM Arch
[MixMAE] MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers | [CVPR'23] |[paper] [code]

MixMAE Arch
[A2MIM] Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN | [ICML'23] | [paper] [code]

A2MIM Arch
[FD] Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation | [arxiv'22] | [paper] [code]

FD Arch
[ObjMAE] Object-wise Masked Autoencoders for Fast Pre-training | [arxiv'22] | [paper]

ObjMAE Arch
[MAE-Lite] A Closer Look at Self-Supervised Lightweight Vision Transformers | [ICML'23] | [paper] [code]

MAE-Lite Arch
[SupMAE] SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners | [arxiv'22] | [paper] [code]

SupMAE Arch
[HiViT] HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling | [ICLR'23] | [paper] [mmpretrian code]

HiViT Arch
[LoMaR] Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction | [arxiv'22] | [paper] [code]

LoMaR Arch
[SIM] Siamese Image Modeling for Self-Supervised Vision Representation Learning | [CVPR'23] | [paper] [code]

SIM Arch
[MFM] Masked Frequency Modeling for Self-Supervised Visual Pre-Training | [ICLR'23] | [paper] [code]

MFM Arch
[BootMAE] Bootstrapped Masked Autoencoders for Vision BERT Pretraining | [ECCV'22] | [paper] [code]

BootMAE Arch
[CMAE] Contrastive Masked Autoencoders are Stronger Vision Learners | [arxiv'22] | [paper] [code]

CMAE Arch
[SMD] Improving Self-supervised Lightweight Model Learning via Hard-aware Metric Distillation | [ECCV'22] | [paper] [code]

SMD Arch
[SdAE] SdAE: Self-distillated Masked Autoencoder | [ECCV'22] | [paper] [code]

SdAE Arch
[MILAN] MILAN: Masked Image Pretraining on Language Assisted Representation | [arxiv'22] | [paper] [code]

MILAN Arch
[BEiTv2] BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers | [arxiv'22] | [paper] [code]

BEiTv2 Arch
[BEiTv3] Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks | [CVPR'23] | [paper] [code]

BEiTv3 Arch
[MaskCLIP] MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining | [CVPR'23] | [paper] [code]

BEiTv3 Arch
[MimCo] MimCo: Masked Image Modeling Pre-training with Contrastive Teacher | [arxiv'22] | [paper]

MimCo Arch
[VICRegL] VICRegL: Self-Supervised Learning of Local Visual Features | [NIPS'22] | [paper] [code]

VICRegL Arch
[SSLight] Effective Self-supervised Pre-training on Low-compute Networks without Distillation | [ICLR'23] | [paper] [code]
[U-MAE] How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders | [NIPS'22] | [paper] [code]
[i-MAE] i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable? | [axiv'22] | [paper] [code]

i-MAE Arch
[CAN] A simple, efficient and scalable contrastive masked autoencoder for learning visual representations | [arxiv'22] | [paper] [code]

CAN Arch
[EVA] EVA: Exploring the Limits of Masked Visual Representation Learning at Scale | [CVPR'23] | [paper] [code]

EVA Arch
[CAEv2] CAE v2: Context Autoencoder with CLIP Target | [arxiv'22] | [paper]

CAEv2 Arch
[iTPN] Integrally Pre-Trained Transformer Pyramid Networks | [CVPR'23] | [paper] [code]

iTPN Arch
[SCFS] Semantics-Consistent Feature Search for Self-Supervised Visual Representation Learning | [ICCV'23] | [paper] [code]

SCFS Arch
[FastMIM] FastMIM: Expediting Masked Image Modeling Pre-training for Vision | [arxiv'22] | [paper] [code]

FastMIM Arch
[Light-MoCo] Establishing a stronger baseline for lightweight contrastive models | [ICME'23] | [paper] [code] [ICLR'23 under-review version]

Light-MoCo Arch
[Scale-MAE] Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning | [ICCV'23] | [paper]

Scale-MAE Arch

2023

[ConvNeXtv2] ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders | [CVPR'23] | [paper] [code]

ConvNeXtv2 Arch
[Spark] Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling | [ICLR'23] | [paper] [code]

Spark Arch
[I-JEPA] Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture | [CVPR'23] | [paper] [code]

I-JEPA Arch
[RoB] A Simple Recipe for Competitive Low-compute Self supervised Vision Models | [arxiv'23] | [paper]

RoB Arch
[Layer Grafted] Layer Grafted Pre-training: Bridging Contrastive Learning And Masked Image Modeling For Label-Efficient Representations | [ICLR'23] | [paper] [code]

Layer Grafted Arch
[G2SD] Generic-to-Specific Distillation of Masked Autoencoders | [CVPR'23] | [paper] [code]

G2SD Arch
[PixMIM] PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling | [arxiv'23] | [paper] [code]

PixMIM Arch
[LocalMIM] Masked Image Modeling with Local Multi-Scale Reconstruction | [CVPR'23] | [paper] [code]

LocalMIM Arch
[MR-MAE] Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking | [arxiv'23] | [paper]

MR-MAE Arch
[Overcoming-Pretraining-Bias] Overwriting Pretrained Bias with Finetuning Data | [ICCV'23] | [paper] [code]

Overcoming-Pretraining-Bias Arch
[MixedAE] Mixed Autoencoder for Self-supervised Visual Representation Learning | [CVPR'23] | [paper]

MixedAE Arch
[EMP] EMP-SSL: Towards Self-Supervised Learning in One Training Epoch | [arxiv'23] | [paper] [code]

EMP Arch
[DINOv2] DINOv2：Learning Robust Visual Features without Supervision | [arxiv'23] | [paper] [code]
[CL-vs-MIM] What Do Self-Supervised Vision Transformers Learn? | [ICLR'23] |[paper] [code]
[SiamMAE] Siamese Masked Autoencoders | [NIPS'23] | [paper]

SiamMAE Arch
[ccMIM] Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining | [ICLR'23] | [paper] [code]

ccMIM Arch
[DreamTeacher] DreamTeacher: Pretraining Image Backbones with Deep Generative Models | [ICCV'23] |[paper]

DreamTeacher Arch
[MFF] Improving Pixel-based MIM by Reducing Wasted Modeling Capability | [ICCV'23] | [paper] [code]

MFF Arch
[DropPos] DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions | [NIPS'23] | [paper] [code]

DropPos Arch
[Registers] Vision Transformers Need Registers | [arxiv'23] | [paper] [code]

Registers Arch
[D-iGPT] Rejuvenating image-GPT as Strong Visual Representation Learners | [arxiv'23] | [paper] [code]

D-iGPT Arch
[SynCLR] Learning Vision from Models Rivals Learning Vision from Data | [arxiv'23] | [paper] [code]

2024

[AIM] Scalable Pre-training of Large Autoregressive Image Models | [arxiv'24] | [paper] [code]

AIM Arch
[CrossMAE] Rethinking Patch Dependence for Masked Autoencoders | [arxiv'24] | [paper] [code]

CrossMAE Arch
[Cross-Scale MAE] Cross-Scale MAE: A Tale of Multi-Scale Exploitation in Remote Sensing | [NIPS'23] | [paper] [code]

Cross-Scale MAE Arch
[MIM-Refiner] MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations | [arxiv'24] | [paper] [code]

MIM-Refiner Arch

youngtboy/Awesome-Self-Supervised-Vision-Pretrain