/Image-Pretraining-for-Video

[ECCV 2022] This repository includes the official implementation our paper "In Defense of Image Pre-Training for Spatiotemporal Recognition".

Primary LanguagePythonMIT LicenseMIT

In Defense of Image Pre-Training for Spatiotemporal Recognition

[NEW!] 2022/7/8 - Our paper has been accepted by ECCV 2022.

2022/5/5 - We have released the code and models.

Overview

This is a PyTorch/GPU implementation of the paper In Defense of Image Pre-Training for Spatiotemporal Recognition.

Architecture
The overall Overview of Image Pre-Training & Spatiotemporal Fine-Tuning..

Content

Prerequisites

The code is built with following libraries:

Video Dataset Preparation

We mainly focus on two widely-used video classification benchmarks Kinetics-400 and Something-Something V2.

Some notes before preparing the two datasets:

  1. We decode the video online to reduce the cost of storage. In our experiments, the cpu bottleneck issue only appears when input frames are more than 8.

  2. The frame resolution of Kinetics-400 we used is with a short-side 320. The number of train / validation data for our experiments is 240436 /19796. We also provide the train/val list.

We provide our annotation and data structure bellow for easy installation.

  • Generate the annotation.

    The annotation usually includes train.txt, val.txt. The format of *.txt file is like:

    video_1 label_1
    video_2 label_2
    video_3 label_3
    ...
    video_N label_N
    

    The pre-processed dataset is organized with the following structure:

    datasets
      |_ Kinetics400
        |_ videos
        |  |_ video_0
        |  |_ video_1
           |_ ...  
           |_ video_N 
        |_ train.txt
        |_ val.txt
    

Model ZOO

Here we provide video dataset list and pretrained weights in this OneDrive or GoogleDrive.

ImageNet-1k

We provide ImageNet-1k pre-trained weights for five video models. All models are trained for 300 epochs. Please follow the scripts we provided to evaluate or finetune on video dataset.

Models/Configs Resolution Top-1 Checkpoints
ir-CSN50 224 * 224 78.8% ckpt
R2plus1d34 224 * 224 79.6% ckpt
SlowFast50-4x16 224 * 224 79.9% ckpt
SlowFast50-8x8 224 * 224 79.1% ckpt
Slowonly50 224 * 224 79.9% ckpt
X3D-S 224 * 224 74.8% ckpt

Kinetics-400

Here we provided the 50-epoch fine-tuning configs and checkpoints. We also include some 100-epochs checkpoints for a better performance but with a comparable computation.

Models/Configs Resolution Frames * Crops * Clips 50-epoch Top-1 100-epoch Top1 Checkpoints folder
ir-CSN50 256 * 256 32 * 3 * 10 76.8% 76.7% ckpt
R2plus1d34 256 * 256 8 * 3 * 10 76.2% Over training budget ckpt
SlowFast50-4x16 256 * 256 32 * 3 * 10 76.2% 76.9% ckpt
SlowFast50-8x8 256 * 256 32 * 3 * 10 77.2% 77.9% ckpt
Slowonly50 256 * 256 8 * 3 * 10 75.7% Over training budget ckpt
X3D-S 192 * 192 13 * 3 * 10 72.5% 73.9% ckpt

Something-Something V2

Models/Configs Resolution Frames * Crops * Clips Top-1 Checkpoints
ir-CSN50 256 * 256 8 * 3 * 1 61.4% ckpt
R2plus1d34 256 * 256 8 * 3 * 1 63.0% ckpt
SlowFast50-4x16 256 * 256 32 * 3 * 1 57.2% ckpt
Slowonly50 256 * 256 8 * 3 * 1 62.7% ckpt
X3D-S 256 * 256 8 * 3 * 1 58.3% ckpt

After downloading the checkpoints and putting them into the target path, you can fine-tune or test the models with corresponding configs following the instruction bellow.

Usage

Build

After having the above dependencies, run:

git clone https://github.com/UCSC-VLAA/Image-Pretraining-for-Video
cd Image_Pre_Training # first pretrain the 3D model on ImageNet
cd Spatiotemporal_Finetuning # then finetune the model on target video dataset

Pre-Training

We have provided some widely-used 3D model pre-trained weights that you can directly use for evaluation or fine-tuning.

After downloading the pre-training weights, for example, you can evaluate the CSN model on Imagenet by running:

bash scripts/csn/distributed_eval.sh [number of gpus]

The pre-training scripts for listed models are located in scripts. Before training the model on ImageNet, you should indicate some paths you would like to store the checkpoint your data path and --output. By default, we use wandb to show the curve.

For example, pre-train a CSN model on Imagenet:

bash scripts/csn/distributed_train.sh [number of gpus]

Fine-tuning

After pre-training, you can use the following command to fine-tune a video model.

Some Notes:

  • In the config file, change the load_from = [your pre-trained model path].

  • Simply setting the reshape_t or reshape_st in the model config to False can disable the STS Conv.

Then you can use the following command to fine-tune the models.

bash tools/dist_train.sh ${CONFIG_FILE} [optional arguments]

Example: train a CSN model on Kinetics-400 dataset with periodic validation.

bash tools/dist_train.sh configs/recognition/csn/ircsn50_32x2_STS_k400_video.py [number of gpus] --validate 

Testing

You can use the following command to test a model.

bash tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test a CSN model on Kinetics-400 dataset and dump the result to a json file.

bash tools/dist_test.sh configs/recognition/csn/ircsn50_32x2_STS_k400_video.py \
    checkpoints/SOME_CHECKPOINT.pth [number of gpus] --eval top_k_accuracy mean_class_accuracy \
    --out result.json --average-clips prob 

Acknowledgment

This repo is based on timm and mmaction2. Thanks the contributors of these repos!

Citation

@inproceedings{li2022videopretraining,
  title     = {In Defense of Image Pre-Training for Spatiotemporal Recognition}, 
  author    = {Xianhang Li and Huiyu Wang and Chen Wei and Jieru Mei and Alan Yuille and Yuyin Zhou and Cihang Xie},
  booktitle = {ECCV},
  year      = {2022},
}