XinyuSun/awesome-self-supervised-representation-learning

awesome video representation learning

MIT

Awesome Self-supervised Representation Learning:

A curated list of self-supervised representation learning.

Contents

Self-supervised Vedio Representation Learning
- Paper
  - 2021 - 2020 - 2019
- Datasets
- Benchmark Results
Self-supervised Visual Representation Learning
- Paper
  - 2021 - 2020
- Datasets
- Benchmark Results
Self-supervised Multi-modal Representation Learning
- Paper
  - 2021 - 2020
- Datasets
- Benchmark Results

Self-supervised Vedio Representation Learning

Papers

2021

[STS] Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics - Jiangliu Wang et al, TPAMI 2021
[VideoMoCo] VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples - Tian Pan et al, CVPR 2021
[BE] Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning - Jinpeng Wang et al, CVPR 2021
[RSPNet] RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning - Peihao Chen et al, AAAI 2021
[DSM] Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion - Jinpeng Wang et al, AAAI 2021
[TCLR] TCLR: Temporal Contrastive Learning for Video Representation - Ishan Dave et al, Arxiv 2021

2020

[CCL] Cycle-Contrast for Self-Supervised Video Representation Learning - Quan Kong et al, NeurIPS 2020
[CoCLR] Self-supervised Co-training for Video Representation Learning - Tengda Han et al, NeurIPS 2020
[PRP] Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning - Yuan Yao et al, CVPR 2020
[VRE-MRA] Exploiting Motion Information from Unlabeled Videos for Static Image Action Recognition - Yiyi Zhang et al, AAAI 2020
[VCP] Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning - Dezhao Luo et al, AAAI 2020
[IIC] Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework - Li Tao et al, ACMMM 2020
[RTT] Video Representation Learning by Recognizing Temporal Transformations - Simon Jenni et al, ECCV 2020
[VPP] Self-Supervised Video Representation Learning by Pace Prediction - Jiangliu Wang et al, ECCV 2020
[DTG-Net] DTG-Net: Differentiated Teachers Guided Self-Supervised Video Action Recognition - Ziming Liu et al, Arxiv 2020
[VTDL] Self-supervised Temporal Discriminative Learning for Video Representation Learning - Jinpeng Wang et al, Arxiv 2020
[CVRL] Spatiotemporal Contrastive Video Representation Learning - Rui Qian et al, Arxiv 2020
[PCL] Self-Supervised Video Representation Using Pretext-Contrastive Learning - Li Tao et al, Arxiv 2020
[TCE] Temporally Coherent Embeddings for Self-Supervised Video Representation Learning - Joshua Knights et al, Arxiv 2020
[HDC] Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning - Zehua Zhang et al, Arxiv 2020
[CEP] Back to the Future: Cycle Encoding Prediction for Self-supervised Contrastive Video Representation Learning - Xinyu Yang et al, Arxiv 2020
[TaCo] Can Temporal Information Help with Contrastive Self-Supervised Learning? - Yutong Bai et al, Arxiv 2020

2019

[DPC] Video Representation Learning by Dense Predictive Coding - Tengda Han et al, ICCV Workshops 2019

Others

[] Space-Time Correspondence as a Contrastive Random Walk - Allan Jabri et al, NeurIPS 2020
[] Exploiting Temporal Coherence for Self-Supervised One-shot Video Re-identification - Dripta S. Raychaudhuri et al, ECCV 2020
[] Self-Supervision by Prediction for Object Discovery in Videos - Beril Besbinar et al, Arxiv 2021

Datasets

UCF101
HMDB51

Benchmark Results

UCF101 & HMDB51

Method	Conference	Network	Input size	Pretrain Dataset	UCF101 Acc.(%)	HMDB51 Acc.(%)
[TCLR]	Arxiv 2021	R(2+1)D	112x112	Kinetic-400	84.3	54.2
		R3D	112x112	Kinetic-400	84.1	53.6
[DSM]	AAAI 2021	R3D-34	224x224	Kinetic-400	78.2	52.8
		I3D	224x224	Kinetic-400	74.8	52.5
[RSPNet]	AAAI 2021	S3D-G	112x112	Kinetic-400	93.7	64.7
		R(2+1)D	112x112	Kinetic-400	81.1	44.6
		C3D	112x112	Kinetic-400	76.7	44.6
		R3D	112x112	Kinetic-400	74.3	41.8
[BE]	CVPR 2021	R3D	224x224	Kinetic-400	87.1	56.2
		I3D	224x224	Kinetic-400	86.8	55.4
[VMoCo]	CVPR 2021	R(2+1)D	112x112	Kinetic-400	78.7	49.2
		R3D	112x112	Kinetic-400	74.1	43.6
[STS]	TPAMI 2021	S3D-G	224x224	Kinetic-400	89.0	62.0
[CEP]	Arxiv 2020	R(2+1)D	224x224	Kinetic-400	76.3	36.8
		SlowFast	128x128	Kinetic-400	68.5	36.8
[HDC]	Arxiv 2020	R(2+1)D	112x112	Kinetic-400	76.2	39.8
		C3D	112x112	Kinetic-400	72.3	39.3
		R3D	112x112	Kinetic-400	68.5	38.1
[TCE]	Arxiv 2020	R2D-50	224x224	Kinetic-400	71.2	36.6
		R2D-18	224x224	Kinetic-400	68.8	34.2
[PCL]	Arxiv 2020	R3D	112x112	Kinetic-400	82.3	43.2
		R(2+1)D	112x112	Kinetic-400	80.7	44.6
		C3D	112x112	Kinetic-400	79.7	42.3
		R3D	112x112	Kinetic-400	79.5	41.7
[CVRL]	Arxiv 2020	R3D-101	224x224	Kinetic-600	93.6	69.4
		R3D-101	224x224	Kinetic-400	92.9	66.7
[VTDL]	Arxiv 2020	R(2+1)D	224x224	Kinetic-400	84.9	52.5
		I3D	224x224	Kinetic-400	82.1	52.9
		R3D	224x224	Kinetic-400	78.4	49.1
[DTG-Net]	Arxiv 2020	TSN-ResNet18	-	Kinetic-400	69.1	-
[VPP]	Arxiv 2020	R(2+1)D	224x224	Kinetic-400	77.1	36.6
[RTT]	ECCV 2020	R3D	112x112	Kinetic-400	79.3	49.8
		C3D	112x112	Kinetic-400	69.9	39.6

Self-supervised Visual Representation Learning

Papers

2021

[] Context Matters: Graph-based Self-supervised Representation Learning for Medical Images - Li Sun et al, AAAI 2021

2020

[] CompRess: Self-Supervised Learning by Compressing Representations - Soroush Abbasi Koohpayegani et al, NeurIPS 2020
[] Self-Supervised Visual Representation Learning from Hierarchical Grouping - Xiao Zhang et al, NeurIPS 2020
[BYOL] Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning - Jean-Bastien Grill et al, NeurIPS 2020

Self-supervised Multi-modal Representation Learning

Papers

2021

Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning - Jingran Zhang et al, AAAI 2021

2020

Self-Supervised Learning by Cross-Modal Audio-Video Clustering - Humam Alwassel et al, NeurIPS 2020
Self-Supervised MultiModal Versatile Networks - Jean-Baptiste Alayrac et al, NeurIPS 2020
Labelling unlabelled videos from scratch with multi-modal self-supervision - Yuki Asano et al, NeurIPS 2020
[AVSA] Learning Representations from Audio-Visual Spatial Alignment - Pedro Morgado et al, NeurIPS 2020
[ELO] Evolving Losses for Unsupervised Video Representation Learning - AJ Piergiovanni et al, CVPR 2020
Audio-Visual Instance Discrimination with Cross-Modal Agreement - Pedro Morgado et al, Arxiv 2020