Introduction

This repository is the official implementation of Contextual Transformer Networks for Visual Recognition for Object Detection and Instance Segmentation.

CoT is a unified self-attention building block, and acts as an alternative to standard convolutions in ConvNet. As a result, it is feasible to replace convolutions with their CoT counterparts for strengthening vision backbones with contextualized self-attention.

Usage

Requirement:

Clone the repository:

git clone https://github.com/JDAI-CV/CoTNet-ObjectDetection-InstanceSegmentation.git

Train

First, download the COCO dataset. Then copy the code into detectron2 and build detectron2. To train CoTNet-50 on a single node with 8 gpus:

python3 tools/train_net.py --num-gpus 8 --config-file configs/ObjectDetection/Faster-RCNN/CoTNet-50/faster_rcnn_CoT_50_FPN_1x.yaml

The training configs for CoTNet (e.g., CoTNet-50) can be found in the configs folder.

The pre-trained CoTNet models for Object Detection and Instance Segmentation can be downloaded here.

Results on Object Detection task

Faster-RCNN

Backbone AP AP50 AP75 APs APm APl config file
CoTNet-50 43.50 64.84 47.53 26.36 47.54 56.49 log/config
CoTNeXt-50 44.06 65.76 47.65 27.08 47.70 57.21 log/config
SE-CoTNetD-50 43.96 65.20 48.25 27.71 47.05 56.51 log/config
CoTNet-101 45.35 66.80 49.18 28.65 49.47 58.82 log/config
CoTNeXt-101 46.10 67.50 50.22 29.44 49.84 59.26 log/config
SE-CoTNetD-101 45.66 66.86 50.11 29.83 49.25 59.17 log/config

Cascade-RCNN

Backbone AP AP50 AP75 APs APm APl config file
CoTNet-50 46.11 64.68 49.75 28.71 49.76 60.28 log/config
CoTNeXt-50 46.79 65.54 50.53 29.74 50.49 61.04 log/config
SE-CoTNetD-50 46.77 64.91 50.46 28.90 50.28 60.92 log/config
CoTNet-101 48.19 67.00 52.17 30.00 52.32 62.87 log/config
CoTNeXt-101 49.02 67.67 53.03 31.44 52.95 63.17 log/config
SE-CoTNetD-101 49.02 67.78 53.15 31.26 52.76 63.29 log/config

Results on Instance Segmentation task

Mask-RCNN

Backbone AP(bb) AP50(bb) AP75(bb) AP(mk) AP50(mk) AP75(mk) config file
CoTNet-50 44.06 64.99 48.29 39.28 62.12 42.17 log/config
CoTNeXt-50 44.47 65.74 48.71 39.62 62.70 42.35 log/config
SE-CoTNetD-50 44.16 65.26 48.32 39.38 62.18 42.23 log/config
CoTNet-101 46.17 67.17 50.63 40.86 64.18 43.64 log/config
CoTNeXt-101 46.66 67.70 50.90 41.21 64.45 44.27 log/config
SE-CoTNetD-101 46.67 67.85 51.30 41.53 64.92 44.69 log/config

Cascade-Mask-RCNN

Backbone AP(bb) AP50(bb) AP75(bb) AP(mk) AP50(mk) AP75(mk) config file
CoTNet-50 46.94 65.36 50.69 40.25 62.37 43.38 log/config
CoTNeXt-50 47.63 65.93 51.64 40.76 63.32 44.01 log/config
SE-CoTNetD-50 47.44 65.93 51.27 40.73 63.22 44.09 log/config
CoTNet-101 48.97 67.42 53.10 41.98 64.81 45.39 log/config
CoTNeXt-101 49.35 67.88 53.53 42.20 65.00 45.69 log/config
SE-CoTNetD-101 49.24 67.45 53.36 42.38 64.79 45.89 log/config

Citing Contextual Transformer Networks

@article{cotnet,
  title={Contextual Transformer Networks for Visual Recognition},
  author={Li, Yehao and Yao, Ting and Pan, Yingwei and Mei, Tao},
  journal={arXiv preprint arXiv:2107.12292},
  year={2021}
}

Acknowledgements

Thanks the contribution of timm and awesome PyTorch team.