A curated list of self-supervised representation learning.
- Self-supervised Vedio Representation Learning
- Self-supervised Visual Representation Learning
- Self-supervised Multi-modal Representation Learning
- [STS] Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics - Jiangliu Wang et al,
TPAMI 2021
- [VideoMoCo] VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples - Tian Pan et al,
CVPR 2021
- [BE] Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning - Jinpeng Wang et al,
CVPR 2021
- [RSPNet] RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning - Peihao Chen et al,
AAAI 2021
- [DSM] Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion - Jinpeng Wang et al,
AAAI 2021
- [TCLR] TCLR: Temporal Contrastive Learning for Video Representation - Ishan Dave et al,
Arxiv 2021
-
[CCL] Cycle-Contrast for Self-Supervised Video Representation Learning - Quan Kong et al,
NeurIPS 2020
-
[CoCLR] Self-supervised Co-training for Video Representation Learning - Tengda Han et al,
NeurIPS 2020
-
[PRP] Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning - Yuan Yao et al,
CVPR 2020
-
[VRE-MRA] Exploiting Motion Information from Unlabeled Videos for Static Image Action Recognition - Yiyi Zhang et al,
AAAI 2020
-
[VCP] Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning - Dezhao Luo et al,
AAAI 2020
-
[IIC] Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework - Li Tao et al,
ACMMM 2020
-
[RTT] Video Representation Learning by Recognizing Temporal Transformations - Simon Jenni et al,
ECCV 2020
-
[VPP] Self-Supervised Video Representation Learning by Pace Prediction - Jiangliu Wang et al,
ECCV 2020
-
[DTG-Net] DTG-Net: Differentiated Teachers Guided Self-Supervised Video Action Recognition - Ziming Liu et al,
Arxiv 2020
-
[VTDL] Self-supervised Temporal Discriminative Learning for Video Representation Learning - Jinpeng Wang et al,
Arxiv 2020
-
[CVRL] Spatiotemporal Contrastive Video Representation Learning - Rui Qian et al,
Arxiv 2020
-
[PCL] Self-Supervised Video Representation Using Pretext-Contrastive Learning - Li Tao et al,
Arxiv 2020
-
[TCE] Temporally Coherent Embeddings for Self-Supervised Video Representation Learning - Joshua Knights et al,
Arxiv 2020
-
[HDC] Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning - Zehua Zhang et al,
Arxiv 2020
-
[CEP] Back to the Future: Cycle Encoding Prediction for Self-supervised Contrastive Video Representation Learning - Xinyu Yang et al,
Arxiv 2020
-
[TaCo] Can Temporal Information Help with Contrastive Self-Supervised Learning? - Yutong Bai et al,
Arxiv 2020
- [DPC] Video Representation Learning by Dense Predictive Coding - Tengda Han et al,
ICCV Workshops 2019
- [] Space-Time Correspondence as a Contrastive Random Walk - Allan Jabri et al,
NeurIPS 2020
- [] Exploiting Temporal Coherence for Self-Supervised One-shot Video Re-identification - Dripta S. Raychaudhuri et al,
ECCV 2020
- [] Self-Supervision by Prediction for Object Discovery in Videos - Beril Besbinar et al,
Arxiv 2021
- UCF101
- HMDB51
Method | Conference | Network | Input size | Pretrain Dataset | UCF101 Acc.(%) | HMDB51 Acc.(%) |
---|---|---|---|---|---|---|
[TCLR] | Arxiv 2021 | R(2+1)D | 112x112 | Kinetic-400 | 84.3 | 54.2 |
R3D | 112x112 | Kinetic-400 | 84.1 | 53.6 | ||
[DSM] | AAAI 2021 | R3D-34 | 224x224 | Kinetic-400 | 78.2 | 52.8 |
I3D | 224x224 | Kinetic-400 | 74.8 | 52.5 | ||
[RSPNet] | AAAI 2021 | S3D-G | 112x112 | Kinetic-400 | 93.7 | 64.7 |
R(2+1)D | 112x112 | Kinetic-400 | 81.1 | 44.6 | ||
C3D | 112x112 | Kinetic-400 | 76.7 | 44.6 | ||
R3D | 112x112 | Kinetic-400 | 74.3 | 41.8 | ||
[BE] | CVPR 2021 | R3D | 224x224 | Kinetic-400 | 87.1 | 56.2 |
I3D | 224x224 | Kinetic-400 | 86.8 | 55.4 | ||
[VMoCo] | CVPR 2021 | R(2+1)D | 112x112 | Kinetic-400 | 78.7 | 49.2 |
R3D | 112x112 | Kinetic-400 | 74.1 | 43.6 | ||
[STS] | TPAMI 2021 | S3D-G | 224x224 | Kinetic-400 | 89.0 | 62.0 |
[CEP] | Arxiv 2020 | R(2+1)D | 224x224 | Kinetic-400 | 76.3 | 36.8 |
SlowFast | 128x128 | Kinetic-400 | 68.5 | 36.8 | ||
[HDC] | Arxiv 2020 | R(2+1)D | 112x112 | Kinetic-400 | 76.2 | 39.8 |
C3D | 112x112 | Kinetic-400 | 72.3 | 39.3 | ||
R3D | 112x112 | Kinetic-400 | 68.5 | 38.1 | ||
[TCE] | Arxiv 2020 | R2D-50 | 224x224 | Kinetic-400 | 71.2 | 36.6 |
R2D-18 | 224x224 | Kinetic-400 | 68.8 | 34.2 | ||
[PCL] | Arxiv 2020 | R3D | 112x112 | Kinetic-400 | 82.3 | 43.2 |
R(2+1)D | 112x112 | Kinetic-400 | 80.7 | 44.6 | ||
C3D | 112x112 | Kinetic-400 | 79.7 | 42.3 | ||
R3D | 112x112 | Kinetic-400 | 79.5 | 41.7 | ||
[CVRL] | Arxiv 2020 | R3D-101 | 224x224 | Kinetic-600 | 93.6 | 69.4 |
R3D-101 | 224x224 | Kinetic-400 | 92.9 | 66.7 | ||
[VTDL] | Arxiv 2020 | R(2+1)D | 224x224 | Kinetic-400 | 84.9 | 52.5 |
I3D | 224x224 | Kinetic-400 | 82.1 | 52.9 | ||
R3D | 224x224 | Kinetic-400 | 78.4 | 49.1 | ||
[DTG-Net] | Arxiv 2020 | TSN-ResNet18 | - | Kinetic-400 | 69.1 | - |
[VPP] | Arxiv 2020 | R(2+1)D | 224x224 | Kinetic-400 | 77.1 | 36.6 |
[RTT] | ECCV 2020 | R3D | 112x112 | Kinetic-400 | 79.3 | 49.8 |
C3D | 112x112 | Kinetic-400 | 69.9 | 39.6 |
- [] Context Matters: Graph-based Self-supervised Representation Learning for Medical Images - Li Sun et al,
AAAI 2021
- [] CompRess: Self-Supervised Learning by Compressing Representations - Soroush Abbasi Koohpayegani et al,
NeurIPS 2020
- [] Self-Supervised Visual Representation Learning from Hierarchical Grouping - Xiao Zhang et al,
NeurIPS 2020
- [BYOL] Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning - Jean-Bastien Grill et al,
NeurIPS 2020
- Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning - Jingran Zhang et al,
AAAI 2021
- Self-Supervised Learning by Cross-Modal Audio-Video Clustering - Humam Alwassel et al,
NeurIPS 2020
- Self-Supervised MultiModal Versatile Networks - Jean-Baptiste Alayrac et al,
NeurIPS 2020
- Labelling unlabelled videos from scratch with multi-modal self-supervision - Yuki Asano et al,
NeurIPS 2020
- [AVSA] Learning Representations from Audio-Visual Spatial Alignment - Pedro Morgado et al,
NeurIPS 2020
- [ELO] Evolving Losses for Unsupervised Video Representation Learning - AJ Piergiovanni et al,
CVPR 2020
- Audio-Visual Instance Discrimination with Cross-Modal Agreement - Pedro Morgado et al,
Arxiv 2020