/GloRe

Global Reasoning module for visual recognition

Primary LanguagePythonMIT LicenseMIT

GloRe

Implementation for: Graph-Based Global Reasoning Networks (CVPR19)

Software

  • Image recognition experiments are in MXNet @92053bd
  • Video and segmentation experiments are in PyTorch (0.5.0a0+783f2c6)

Train & Evaluate

Train kinetics (single node):

./run_local.sh

Train kinetics (multiple nodes):

# please setup ./Host before running
./run_dist.sh

Evaluate the trained model on kinetics:

cd test
# check $ROOT/test/*.txt for the testing log
python test-single-clip.py

Note:

  • The code is adapted from MFNet (ECCV18).
  • ImageNet pretrained models (R50, R101) might be required. Please put it under $ROOT/network/pretrained/.
  • For image classification and segmentation tasks, please refer the code below.

Results

Image Recognition (ImageNet-1k)

Model Method Res3 Res4 Code & Model Top-1
ResNet50 Baseline link 76.2 %
ResNet50 w/ GloRe +3 link 78.4 %
ResNet50 w/ GloRe +2 +3 link 78.2 %
SE-ResNet50 Baseline link 77.2 %
SE-ResNet50 w/ GloRe +3 link 78.7 %
Model Method Res3 Res4 Code & Model Top-1
ResNet200 w/ GloRe +3 link 79.4 %
ResNet200 w/ GloRe +2 +3 link 79.7 %
ResNeXt101 (32x4d) w/ GloRe +2 +3 link 79.8 %
DPN-98 w/ GloRe +2 +3 link 80.2 %
DPN-131 w/ GloRe +2 +3 link 80.3 %

* We use pre-activation[1] and strided convolution[2] for all networks for simplicity and consistency.

Video Recognition (Kinetics-400)

Model input frames stride Res3 Res4 Model Clip Top-1
Res50 (3D) + Ours 8 8 +2 +3 link 68.0 %
Res101 (3D) + Ours 8 8 +2 +3 link 69.2 %

* ImageNet-1k pretrained models: R50(link), R101(link).

Semantic Segmentation (Cityscapes)

Method Backbone Code & Model IoU cla. iIoU cla. IoU cat. iIoU cat.
FCN + 1 GloRe unit ResNet50 link 79.5% 60.3% 91.3% 81.5%
FCN + 1 GloRe unit ResNet101 link 80.9% 62.2% 91.5% 82.1%

* All networks are evaluated on Cityscapes test set by the testing server without using extra “coarse” training set.

Other Resources

ImageNet-1k Training/Validation List:

ImageNet-1k category name mapping table:

Kinetics Dataset:

Cityscapes Dataset:

FAQ

Where can I find the code for image classification and segmentation?

  • The code is packed with the model within the same *.tar file.

Do I need to convert the raw videos to specific format?

  • The `dataiter' supports reading from raw videos.

How can I make the training faster?

  • Remove HLS augmentation (won't make much difference); Try to convert the raw videos to lower resolution to reduce the decoding cost (We use <=288p for all experiment).

For example:

# convet to sort_edge_length <= 288
ffmpeg -y -i ${SRC_VID} -c:v mpeg4 -filter:v "scale=min(iw\,(288*iw)/min(iw\,ih)):-1" -b:v 640k -an ${DST_VID}
# or, convet to sort_edge_length <= 256
ffmpeg -y -i ${SRC_VID} -c:v mpeg4 -filter:v "scale=min(iw\,(256*iw)/min(iw\,ih)):-1" -b:v 512k -an ${DST_VID}
# or, convet to sort_edge_length <= 160
ffmpeg -y -i ${SRC_VID} -c:v mpeg4 -filter:v "scale=min(iw\,(160*iw)/min(iw\,ih)):-1" -b:v 240k -an ${DST_VID}

Reference

[1] He, Kaiming, et al. "Identity mappings in deep residual networks."
[2] https://github.com/facebook/fb.resnet.torch

Citation

@inproceedings{chen2019graph,
  title={Graph-based global reasoning networks},
  author={Chen, Yunpeng and Rohrbach, Marcus and Yan, Zhicheng and Shuicheng, Yan and Feng, Jiashi and Kalantidis, Yannis},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={433--442},
  year={2019}
}

License

The code and the models are MIT licensed, as found in the LICENSE file.