
The official GitHub page for the survey paper "Self-Supervised learning for Videos: A survey"


A collection of works on self-supervised, deep-learning learning for video. The papers listed here refers to our survey:

Self-Supervised Learning for Videos: A Survey

Madeline Chantry Schiappa, Yogesh Singh Rawat, Mubarak Shah


In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into four different categories based on their learning objectives: 1) pretext tasks, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement. We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.

Overview of publications Statistics of self-supervised (SSL) video representation learning research in recent years. From left to right we show a) the total number of SSL related papers published in top conference venues, b) categorical breakdown of the main research topics studied in SSL, and (c) modality breakdown of the main modalities used in SSL. The year 2022 remains incomplete because a majority of the conferences occur later in the year.

Overview of publications related to Action Recognition Action recognition performance of models over time for different self-supervised strategies including different modalities: video-only (V), video-text (V+T), video-audio (V+A), video-text-audio (V+T+A). More recently, contrastive learning has become the most popular strategy.

Training Tasks

Pre-Text Learning

Action Recognition

Downstream evaluation of action recognition on pretext self-supervised learning measured by prediction accuracy. Top scores are in bold. Playback speed related tasks typically perform the best.

Model Subcategory Visual Backbone Pre-Train UCF101 HMDB51
Geometry Appearance AlexNet UCF101/HMDB51 54.10 22.60
Wang et al. Appearance C3D UCF101 61.20 33.40
3D RotNet Appearance 3D R-18 MT 62.90 33.70
VideoJigsaw Jigsaw CaffeNet Kinetics 54.70 27.00
3D ST-puzzle Jigsaw C3D Kinetics 65.80 33.70
CSJ Jigsaw R(2+3)D Kinetics+UCF101+HMDB51 79.50 52.60
PRP Speed R3D Kinetics 72.10 35.00
SpeedNet Speed S3D-G Kinetics 81.10 48.80
Jenni et al. Speed R(2+1)D UCF101 87.10 49.80
PacePred Speed S3D-G UCF101 87.10 52.60
ShuffleLearn Temporal Order AlexNet UCF101 50.90 19.80
OPN Temporal Order VGG-M UCF101 59.80 23.80
O3N Temporal Order AlexNet UCF101 60.30 32.50
ClipOrder Temporal Order R3D UCF101 72.40 30.90

Video Retreival

Performance for the downstream video retrieval task with top scores for each category in bold. K/U/H indicates using all three datasets for pre-training, i.e. Kinetics, UCF101, and HMDB51.

Model Category Subcategory Visual Backbone Pre-train UCF101 R@5 HMDB51 R@5
SpeedNet Pretext Speed S3D-G Kinetics 28.10 --
ClipOrder Pretext Temporal Order R3D UCF101 30.30 22.90
OPN Pretext Temporal Order CaffeNet UCF101 28.70 --
CSJ Pretext Jigsaw R(2+3)D K/U/H 40.50 --
PRP Pretext Speed R3D Kinetics 38.50 27.20
Jenni et al. Pretext Speed 3D R-18 Kinetics 48.50 --
PacePred Pretext Speed R(2+1)D UCF101 49.70 32.20

Generative Learning

Action Recognition

Downstream action recognition evaluation for models that use a generative self-supervised pre-training approach. Top scores are in bold

Model Subcategory Visual Backbone Pre-train UCF101 HMDB51
Mathieu et al. Frame Prediction C3D Sports1M 52.10 --
VideoGan Reconstruction VAE Flickr 52.90 --
Liang et al. Frame Prediction LSTM UCF101 55.10 --
VideoMoCo Frame Prediction R(2+1)D Kinetics 78.70 49.20
MemDPC-Dual Frame Prediction R(2+3)D Kinetics 86.10 54.50
Tian et al. Reconstruction 3D R-101 Kinetics 88.10 59.00
VideoMAE MAE ViT-L ImageNet 91.3 62.6
MotionMAE MAE ViT-B Kinetics 96.3 --

Video Retreival

Model Category Subcategory Visual Backbone Pre-train UCF101 R@5 HMDB51 R@5
MemDPC-RGP Generative Frame Prediction R(2+3)D Kinetics 40.40 25.70
MemDPC-Flow Generative Frame Prediction R(2+3)D Kinetics 63.20 37.60

Video Captioning

Contrastive Learning

Action Recognition

Action Recognition

Downstream action recognition on UCF101 and HMDB51 for models that use contrastive learning and/or cross-modal agreement. Top scores for each category are in bold. Modalities include video (V), optical flow (F), human keypoints (K), text (T) and audio (A). Spatio-temporal augmentations with contrastive learning typically are the highest performing approaches.

Model Subcategory Visual Modalities Pre-Train UCF101 HMDB51
VIE Clustering Slowfast V Kinetics 78.90 50.1
VIE-2pathway Clustering R-18 V Kinetics 80.40 52.5
Tokmakov et al. Clustering 3D R-18 V Kinetics 83.00 50.4
TCE Temporal Aug. R-50 V UCF101 71.20 36.6
Lorre et al. Temporal Aug. R-18 V+F UCF101 87.90 55.4
CMC-Dual Spatial Aug. CaffeNet V+F UCF101 59.10 26.7
SwAV Spatial Aug. R-50 V Kinetics 74.70 --
VDIM Spatial Aug. R(2+1)D V Kinetics 79.70 49.2
CoCon Spatial Aug. R-34 V+F+K UCF101 82.40 53.1
SimCLR Spatial Aug. R-50 V Kinetics 84.20 --
CoCLR Spatial Aug. S3D-G V+F UCF101 90.60 62.9
MoCo Spatial Aug. R-50 V Kinetics 90.80 --
BYOL Spatial Aug. R-50 V Kinetics 91.20 --
DVIM Spatio-Temporal Aug. R-18 V+F UCF101 64.00 29.7
IIC Spatio-Temporal Aug. R3D V+F Kinetics 74.40 38.3
DSM Spatio-Temporal Aug. I3D V Kinetics 78.20 52.8
pSimCLR Spatio-Temporal Aug. R-50 V Kinetics 87.90 --
TCLR Spatio-Temporal Aug. R(2+1)D V UCF101 88.20 60.0
SeCo Spatio-Temporal Aug. R-50 V ImageNet 88.30 55.6
pSwaV Spatio-Temporal Aug. R-50 V Kinetics 89.40 --
pBYOL Spatio-Temporal Aug. R-50 V Kinetics 93.80 --
CVRL Spatio-Temporal Aug. 3D R-50 V Kinetics 93

Cross-Modal Learning

Text-to-Video Retrieval

Video Captioning

Action Segmentation

Downstream action segmentation evaluation on COIN for models that use a cross-modal agreement self-supervised pre-training approach. The top score is in bold.

Model Visual Text Pre-train Frame-Acc
CBT S3D-G BERT Kinetics+How2 53.90
ActBERT 3D R-32 BERT Kinetics+How2 56.95
VideoClip (zs) S3D-g BERT How2 58.90
MIL-NCE S3D Word2Vec How2 61.00
VLM S3D-g BERT How2 68.39
VideoClip (ft) S3D-g BERT How2 68.70
UniVL S3D-g BERT How2 70.20

Temporal Action Step Localization

Downstream temporal action step localization evaluation on CrossTask for models that use a contrastive multimodal self-supervised pre-training approach. Top scores are in bold.

Model Visual Text Pre-train Recall
VideoClip (zs) S3D-g BERT How2 33.90
MIL-NCE S3D Word2Vec How2 40.50
ActBERT 3D R-32 BERT Kinetics+How2 41.40
UniVL S3D-g BERT How2 42.00
VLM S3D-g BERT How2 46.50
VideoClip (ft) S3D-g BERT How2 47.30

Evaluation Tasks

Action Recognition

Downstream action recognition on UCF101 and HMDB51 for models that use contrastive learning and/or cross-modal agreement. Top scores for each category are in bold. Modalities include video (V), optical flow (F), human keypoints (K), text (T) and audio (A). Spatio-temporal augmentations with contrastive learning typically are the highest performing approaches.

Model Subcategory Visual Modalities Pre-Train UCF101 HMDB51
Geometry Appearance AlexNet V UCF101/HMDB51 54.10 22.60
Wang et al. Appearance C3D V UCF101 61.20 33.40
3D RotNet Appearance 3D R-18 V MT 62.90 33.70
VideoJigsaw Jigsaw CaffeNet Kinetics 54.70 27.00
3D ST-puzzle Jigsaw C3D V Kinetics 65.80 33.70
CSJ Jigsaw R(2+3)D V Kinetics+UCF101+HMDB51 79.50 52.60
PRP Speed R3D V Kinetics 72.10 35.00
SpeedNet Speed S3D-G V Kinetics 81.10 48.80
Jenni et al. Speed R(2+1)D V UCF101 87.10 49.80
PacePred Speed S3D-G V UCF101 87.10 52.60
ShuffleLearn Temporal Order AlexNet V UCF101 50.90 19.80
OPN Temporal Order VGG-M V UCF101 59.80 23.80
O3N Temporal Order AlexNet V UCF101 60.30 32.50
ClipOrder Temporal Order R3D V UCF101 72.40 30.90
VIE Clustering Slowfast V Kinetics 78.90 50.1
VIE-2pathway Clustering R-18 V Kinetics 80.40 52.5
Tokmakov et al. Clustering 3D R-18 V Kinetics 83.00 50.4
TCE Temporal Aug. R-50 V UCF101 71.20 36.6
Lorre et al. Temporal Aug. R-18 V+F UCF101 87.90 55.4
CMC-Dual Spatial Aug. CaffeNet V+F UCF101 59.10 26.7
SwAV Spatial Aug. R-50 V Kinetics 74.70 --
VDIM Spatial Aug. R(2+1)D V Kinetics 79.70 49.2
CoCon Spatial Aug. R-34 V+F+K UCF101 82.40 53.1
SimCLR Spatial Aug. R-50 V Kinetics 84.20 --
CoCLR Spatial Aug. S3D-G V+F UCF101 90.60 62.9
MoCo Spatial Aug. R-50 V Kinetics 90.80 --
BYOL Spatial Aug. R-50 V Kinetics 91.20 --
MIL-NCE Cross-Modal S3D-G V+T How2 61.00 91.3
GDT Cross-Modal R(2+1)D V+T+A Kinetics 72.80 95.5
CBT Cross-Modal S3D-G V+T Kinetics 79.50 44.6
VATT Cross-Modal Transformer V+T AS+How2 85.50 64.8
AVTS Cross-Modal MC3 V+A Kinetics 85.80 56.9
AVID+Cross Cross-Modal R(2+1)D V+A Kinetics 91.00 64.1
AVID+CMA Cross-Modal R(2+1)D V+A Kinetics 91.50 64.7
MMV-FAC Cross-Modal TSM V+T+A AS+How2 91.80 67.1
XDC Cross-Modal R(2+1)D V+A Kinetics 95.50 68.9
DVIM Spatio-Temporal Aug. R-18 V+F UCF101 64.00 29.7
IIC Spatio-Temporal Aug. R3D V+F Kinetics 74.40 38.3
DSM Spatio-Temporal Aug. I3D V Kinetics 78.20 52.8
pSimCLR Spatio-Temporal Aug. R-50 V Kinetics 87.90 --
TCLR Spatio-Temporal Aug. R(2+1)D V UCF101 88.20 60.0
SeCo Spatio-Temporal Aug. R-50 V ImageNet 88.30 55.6
pSwaV Spatio-Temporal Aug. R-50 V Kinetics 89.40 --
pBYOL Spatio-Temporal Aug. R-50 V Kinetics 93.80 --
CVRL Spatio-Temporal Aug. 3D R-50 V Kinetics 93

Video Retrieval

Video Captioning

Text-to-Video Retrieval

Dataset Labels Modalities Classes Videos Tasks
ActivityNet (ActN) Activity, Captions, Bounding Box Video, Video+Text 200 19,995 Action-Recognition, Video Captioning, Video Grounding
AVA Activity, Face Tracks Video, Video+Audio 80 430 Action-Recognition,Audio-Visual Grounding
Breakfast Activity Video 10 1,989 Action Recognition, Action Segmentation
Charades Activity, Objects, Indoor Scenes, Verbs Video 157 9,848 Action-Recognition, Object Recognition, Scene Recognition, Temporal Action Step Localization
COIN Activity, Temporal Actions, ASR Video, Video+Text 180 11,827 Action-Recognition, Action Segmentation, Video-Retrieval
CrossTask Temporal Steps, Activity Video 83 4,700 Temporal Action Step Localization, Recognition
HMDB51 Activity Video 51 6,849 Action-Recognition, Video-Retrieval
HowTo100M (How2) ASR Video+Text - 136M Text-to-Video Retrieval, VideoQA
Kinetics Activity Video 400/600/700 1/2 M Action-Recognition
MSRVTT Activity, Captions Video+Text 20 10,000 Action-Recognition, Video-Captioning, Video-Retrieval, Visual-Question Answering
MultiThumos Activity, Temporal Steps Video 65 400 Action Recognition, Temporal Action Step Localization
UCF101 Activity Video 101 13,320 Recognition, Video-Retrieval
YouCook2 Captions Video+Text 89 2,000 Video Captioning, Video-Retrieval
YouTube-8M Activity Video 4,716 8M Action Recognition


