Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting

This repository provides the implementation of the WACV 2022 paper: Self-supervised Video Representation learning with Cross-stream Prototypical Contrasting.

Video Cross-Stream Prototypical Contrasting (ViCC)

We leverage both optical flow and RGB as views for contrastive learning, by predicting consistent stream prototype assignments from the views in the training of each model. This effectively transfers knowledge from motion (flow) to appearance (RGB).

Training process

In one alternation, we optimize one model and the corresponding prototypes. The method consists of two stages. In Single-stream, RGB and Flow encoder are trained on their own features. In Cross-stream, both models are trained on both feature types.

Results

Nearest-neighbour video retrieval results on UCF101:

Model	R@1
ViCC-RGB-2	62.1
ViCC-Flow-2	59.7
ViCC-R+F-2	65.1

Results on end-to-end finetuning for action recognition:

News

Pretrained models are now available (2021-08-24)

References

How to run the code

Get started

Requirements

Python 3.6
PyTorch==1.4.0, torchvision 0.5.0
Cuda 10.1
Apex with cuda extension (see also: this issue)
See environment file. => tqdm, pandas, python-lmdb 0.98, mgspack==1.0.0, msgpack-python==0.5.6.

Preprocessing

Follow instructions in process_data.
Optional: See CoCLR for dataset. (last checked: 2021-07-03)

Pretrain and Evaluation

We provide several slurm scripts for pretraining, as well as for linear probe, retrieval and finetuning experiments. Your own paths can be changed in the scripts.
Distributed Training is available via Slurm where the distributed initialization method needs to be set correctly (parameter dist_url).

How to run: pretraining

The algorithm consist of two stages (following CoCLR):

Single-stream: RGB model is trained on RGB data, then Flow on flow data.
Cross-stream: Both models are initialized with single-stream models. RGB is trained on both RGB and Flow data, then Flow is trained on RGB and Flow data. Repeat for N alternations.

Single-stream

Train ViCC-RGB-1 (Single-stream):

sbatch slurm_scripts/pretrain/single-rgb.sh

or:

cd src

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
--nproc_per_node=4 main_single.py --net s3d --model vicc --dataset ucf101-2clip \
--seq_len 32 --num_seq 2 --ds 1 --batch_size 48 --wd 1e-6 --cos True \
--base_lr 0.6 --final_lr 0.0006 \
--epochs 500 --save_epoch 199 --optim sgd --img_dim 128 \
--dataset_root {DATASET_PATH} --prefix {EXPERIMENT_PATH} --name_prefix "single/rgb" \
--workers 12 --moco-dim 128 --moco-k 1920 --moco-t 0.1 \
--views_for_assign 0 1 --nmb_views 2 --epsilon 0.05 --sinkhorn_iterations 3 \
--nmb_prototypes 300 --epoch_queue_starts 200 --freeze_prototypes_nepochs 100 --use_fp16 False

Train ViCC-Flow-1 (Single-stream):

sbatch slurm_scripts/pretrain/single-flow.sh

or:

cd src

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
--nproc_per_node=4 main_single.py --net s3d --model vicc --dataset ucf101-f-2clip \
--seq_len 32 --num_seq 2 --ds 1 --batch_size 48 --wd 1e-6 --cos True \
--base_lr 0.6 --final_lr 0.0006 \
--epochs 500 --save_epoch 199 --optim sgd --img_dim 128 \
--dataset_root {DATASET_PATH} --prefix {EXPERIMENT_PATH} --name_prefix "single/flow" \
--workers 12 --moco-dim 128 --moco-k 1920 --moco-t 0.1 \
--views_for_assign 0 1 --nmb_views 2 --epsilon 0.05 --sinkhorn_iterations 3 \
--nmb_prototypes 300 --epoch_queue_starts 200 --freeze_prototypes_nepochs 100 --use_fp16 False

Cross-stream

Train ViCC-RGB-2 and ViCC-Flow-2:

sbatch slurm_scripts/pretrain/cross.sh

or:

cd src

Cycle 1 RGB:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
--nproc_per_node=4 main_cross.py --net s3d --model 'vicc2' --dataset 'ucf101-2stream-2clip' \
--seq_len 32 --num_seq 2 --ds 1 --batch_size 24 --wd 1e-6 --cos True \
--base_lr 0.6 --final_lr 0.0006 --pretrain {ViCC-RGB-1-SINGLE.pth.tar} {ViCC-Flow-1-SINGLE.pth.tar} \
--epochs 100 --save_epoch 24 --optim sgd --img_dim 128 \
--dataset_root {DATASET_PATH} --prefix {EXPERIMENT_PATH} --name_prefix "cross/c1-flow-mining" \
--workers 12 --moco-dim 128 --moco-k 1920 --moco-t 0.1 \
--views_for_assign 0 1 2 3 --nmb_views 2 2 --epsilon 0.05 --sinkhorn_iterations 3 \
--nmb_prototypes 300 --epoch_queue_starts 25 --freeze_prototypes_nepochs 0 --use_fp16 True \

Cycle 1 Flow (notice the reverse argument):

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
--nproc_per_node=4 main_cross.py --net s3d --model 'vicc2' --dataset 'ucf101-2stream-2clip' \
--seq_len 32 --num_seq 2 --ds 1 --batch_size 24 --wd 1e-6 --cos True \
--base_lr 0.6 --final_lr 0.0006 --pretrain {ViCC-Flow-1-SINGLE.pth.tar} {ViCC-RGB-2-CYCLE-1.pth.tar}  \
--epochs 100 --save_epoch 24 --optim sgd --img_dim 128 \
--dataset_root {DATASET_PATH} --prefix {EXPERIMENT_PATH} --name_prefix "cross/c1-rgb-mining" \
--workers 12 --moco-dim 128 --moco-k 1920 --moco-t 0.1 \
--views_for_assign 0 1 2 3 --nmb_views 2 2 --epsilon 0.05 --sinkhorn_iterations 3 \
--nmb_prototypes 300 --epoch_queue_starts 25 --freeze_prototypes_nepochs 0 --use_fp16 True \
--reverse \

Repeat the above two commands for the second cycle (Cycle 2 RGB, Cycle 2 Flow) with the newest checkpoints every run.

How to run: evaluation

Use e.g. sbatch slurm_scripts/eval/retr-rgb-2.sh, sbatch slurm_scripts/eval/lin-rgb-2.sh or sbatch slurm_scripts/eval/ft-rgb-2.sh. The '2' in the name of the scripts indicates the models for the cross-stream stage, but single-stream models could also be evaluated in the same way.

or:

cd src/eval

Nearest-neighbour video retrieval

For RGB:

CUDA_VISIBLE_DEVICES=0,1 python main_classifier.py --net s3d --dataset ucf101 \
--seq_len 32 --ds 1 --retrieval \
--dirname {FEATURE_PATH} --test {TEST_PATH} --dataset_root {DATASET_PATH}

Use --dataset 'ucf101-f' argument for flow.

Linear probe

For RGB, e.g.:

CUDA_VISIBLE_DEVICES=0,1,2,3 python main_classifier.py --net 's3d' --dataset 'ucf101' \
--seq_len 32 --ds 1 --batch_size 32 --train_what last --optim sgd --lr 1e-1 --wd 1e-3 \
--epochs 100 --schedule 60 80 --name_prefix "lin-rgb-2" \
--prefix {EXPERIMENT_PATH} --pretrain {PRETRAIN_PATH} --dataset_root {DATASET_PATH}

Use --dataset 'ucf101-f' argument for flow.

Test linear probe:

CUDA_VISIBLE_DEVICES=0,1,2,3 python main_classifier.py --net s3d --dataset 'ucf101' \
--batch_size 32 --seq_len 32 --ds 1 --train_what last --ten_crop \
--prefix {EXPERIMENT_PATH} --test {TEST_PATH} --dataset_root {DATASET_PATH}

End-to-end finetuning

For RGB, e.g.:

CUDA_VISIBLE_DEVICES=0,1,2,3 python main_classifier.py --net 's3d' --dataset 'ucf101' \
--seq_len 32 --ds 1 --batch_size 32 --train_what ft --optim sgd --lr 0.1 --wd 0.001 \
--epochs 500 --schedule 200 300 400 450 --name_prefix "ft-rgb-2" \
--prefix {EXPERIMENT_PATH} --pretrain {PRETRAIN_PATH} --dataset_root {DATASET_PATH}

Use --dataset 'ucf101-f' argument for flow.

Test finetuning:

CUDA_VISIBLE_DEVICES=0,1,2,3 python main_classifier.py --net s3d --dataset 'ucf101' \
--batch_size 32 --seq_len 32 --ds 1 --train_what ft --ten_crop \
--prefix {EXPERIMENT_PATH} --test {TEST_PATH} --dataset_root {DATASET_PATH}

Pretrained models

Single-stream:

Citation

If you find this repository helpful in your research, please consider citing our paper:

@article{toering2022selfsupervised,
    title={Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting}, 
    author={Martine Toering and Ioannis Gatopoulos and Maarten Stol and Vincent Tao Hu},
    journal={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    year={2022}
}

Acknowledgements

This work was supported and funded from the University of Amsterdam and BrainCreators B.V.

Author
Martine Toering, 2021

Seleucia/ViCC

Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting

Video Cross-Stream Prototypical Contrasting (ViCC)

Training process

Results

News

References

How to run the code

Get started

Requirements

Preprocessing

Pretrain and Evaluation

How to run: pretraining

Single-stream

Cross-stream

How to run: evaluation

Nearest-neighbour video retrieval

Linear probe

End-to-end finetuning

Pretrained models

Citation

Acknowledgements