Hierarchical Hourglass Convolutional Network for Efficient Video Classification (MM 2022)

This is an official implementaion of paper "Hierarchical Hourglass Convolutional Network for Efficient Video Classification", which has been accepted by MM 2022. Paper link

Updates

Oct 11, 2022

Release this V1 version (the version used in paper) to public. Complete codes and models will be released soon.

Prerequisites

The code is built with following libraries:

PyTorch >= 1.7, torchvision
tensorboardx

For video data pre-processing, you may need ffmpeg.

Data Preparation

We need to first extract videos into frames for all datasets (Kinetics-400, Something-Something V1 and V2, Diving48 and EGTEA Gaze+), following the TSN repo.

Code

The implement of H2CN refers to TSN, TSM,TDN codebases

Pretrained Models

Here we provide some of the pretrained models.

Something-Something

Something-Something V1&V2 datasets are highly temporal-related. Here, we use the 224×224 resolution for performance report.

Something-Something-V1

Model	Frame * view	Top-1 Acc.	Top-5 Acc.
H2CN	8 * 1	53.6%	81.4%
H2CN	16 * 1	55.0%	82.4%
H2CN	(8+16) * 1	56.7%	83.2%

Something-Something-V2

Model	Frame * view	Top-1 Acc.	Top-5 Acc.
H2CN	8 * 1	65.2%	89.7%
H2CN	16 * 1	66.4%	90.1%
H2CN	(8+16) * 1	67.9%	91.2%

Kinetics-400

Model	Frame * view	Top-1 Acc.	Top-5 Acc.	Checkpoint
H2CN	8 * 30	76.9%	93.0%	link(password:defq)
H2CN	16 * 30	77.9%	93.3%	link(password:vert)
H2CN	(8+16) * 30	78.7%	93.6%

Diving48

Model	Frame * view	Top-1 Acc.	Checkpoint
H2CN	16 * 1	87.0%

EGTEA Gaze

Model	Frame * view * clip	Split1	Split2	Split3
H2CN	8 * 1 * 1	66.2%	63.9%	60.5%

Train

python train.py

Test

We use the test code and protocal of repo TDN

For center crop single clip, the processing of testing can be summarized into 2 steps:

Run the following testing scripts:

CUDA_VISIBLE_DEVICES=0 python3 test_models_center_crop.py something \
--archs='resnet50' --weights <your_checkpoint_path>  --test_segments=8  \
--test_crops=1 --batch_size=16  --gpus 0 --output_dir <your_pkl_path> -j 4 --clip_index=0

Run the following scripts to get result from the raw score:

python3 pkl_to_results.py --num_clips 1 --test_crops 1 --output_dir <your_pkl_path>

For 3 crops, 10 clips, the processing of testing can be summarized into 2 steps:

Run the following testing scripts for 10 times(clip_index from 0 to 9):

CUDA_VISIBLE_DEVICES=0 python3 test_models_three_crops.py  kinetics \
--archs='resnet50' --weights <your_checkpoint_path>  --test_segments=8 \
--test_crops=3 --batch_size=16 --full_res --gpus 0 --output_dir <your_pkl_path>  \
-j 4 --clip_index <your_clip_index>

Run the following scripts to ensemble the raw score of the 30 views:

python pkl_to_results.py --num_clips 10 --test_crops 3 --output_dir <your_pkl_path>

Citing

@article{H2CN2022,
  title={Hierarchical Hourglass Convolutional Network for Efficient Video Classification},
  author={Yi Tan, Yanbin Hao, Hao Zhang, Shuo Wang, Xiangnan He},
  journal={MM 2022},
}

Acknowledgement

Thanks for the following Github projects:

ty-97/H2CN