/PyTorch-MFNet

Primary LanguagePythonMIT LicenseMIT

Multi-Fiber Networks for Video Recognition

This repository contains the code and trained models of:

Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, Jiashi Feng. "Multi-Fiber Networks for Video Recognition" (PDF).

Implementation

We use MXNet @92053bd for image classification and PyTorch 0.4.0a0@a83c240 for video classification.

Normalization

The inputs are substrated by mean RGB = [ 124, 117, 104 ], and then multiplied by 0.0167.

Usage

Train motion from scratch:

python train_kinetics.py

Fine-tune with pre-trained model:

python train_ucf101.py

or

python train_hmdb51.py

Evaluate the trained model:

cd test
# the default setting is to test trained model on ucf-101 (split1)
python evaluate_video.py

Results

Image Recognition (ImageNet-1k)

Single Model, Single Crop Validation Accuracy:

Model Params FLOPs Top-1 Top-5 MXNet Model
ResNet-18 (reproduced) 11.7 M 1.8 G 71.4 % 90.2 % GoogleDrive
ResNet-18 (MF embedded) 9.6 M 1.6 G 74.3 % 92.1 % GoogleDrive
MF-Net (N=16) 5.8 M 861 M 74.6 % 92.0 % GoogleDrive

Video Recognition (UCF-101, HMDB51, Kinetics)

Model Params Target Dataset Top-1
MF-Net (3D) 8.0 M Kinetics 72.8 %
MF-Net (3D) 8.0 M UCF-101 96.0 %*
MF-Net (3D) 8.0 M HMDB51 74.6 %*

* accuracy averaged on slip1, slip2, and slip3.

Trained Models

Model Target Dataset PyTorch Model
MF-Net (2D) ImageNet-1k GoogleDrive
MF-Net (3D) Kinetics GoogleDrive
MF-Net (3D) UCF-101 (split1) GoogleDrive
MF-Net (3D) HMDB51 (split1) GoogleDrive

Other Resources

ImageNet-1k Trainig/Validation List:

ImageNet-1k category name mapping table:

Kinetics Dataset:

UCF-101 Dataset:

HMDB51 Dataset:

FAQ

Do I need to convert the raw videos to specific format?

  • Our `dataiter' supports reading from raw videos and can tolerate corrupted videos.

How can I make the training faster?

  • Decoding frames from compressed videos consumes quite a lot CPU resources which is the bottleneck for the speed. You can try to convert the downloaded videos to other format or reduce the quality of the video. For example:
# convet to sort_edge_length = 360
ffmpeg -y -i ${SRC_VID} -c:v mpeg4 -filter:v "scale=min(iw\,(360*iw)/min(iw\,ih)):-1" -b:v 640k -an ${DST_VID}
# or, convet to sort_edge_length = 256
ffmpeg -y -i ${SRC_VID} -c:v mpeg4 -filter:v "scale=min(iw\,(256*iw)/min(iw\,ih)):-1" -b:v 512k -an ${DST_VID}
# or, convet to sort_edge_length = 160
ffmpeg -y -i ${SRC_VID} -c:v mpeg4 -filter:v "scale=min(iw\,(160*iw)/min(iw\,ih)):-1" -b:v 240k -an ${DST_VID}
  • Find another computer with better CPU.
  • The group convolution may not be well optimized.

Citation

If you use our code/model in your work or find it is helpful, please cite the paper:

@inproceedings{chen2018multifiber,
  title={Multi-Fiber networks for Video Recognition},
  author={Chen, Yunpeng and Kalantidis, Yannis and Li, Jianshu and Yan, Shuicheng and Feng, Jiashi},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2018}
}