Fast and accurate single stage object detection with end-to-end GPU optimization.
RetinaNet is a single shot object detector with multiple backbones offering various performance/accuracy trade-offs.
It is optimized for end-to-end GPU processing using:
- The PyTorch deep learning framework
- NVIDIA Apex for mixed precision and distributed training
- NVIDIA DALI for optimized data pre-processing
- NVIDIA TensorRT for high-performance inference
This is a research project, not an official NVIDIA product.
For best performance, we encourage using the latest PyTorch NGC docker container:
nvidia-docker run --rm --ipc=host -it nvcr.io/nvidia/pytorch:19.05-py3
From the container, simply install retinanet using pip
:
pip install --no-cache-dir git+https://github.com/nvidia/retinanet-examples
Or you can clone this repository, build and run your own image:
git clone https://github.com/nvidia/retinanet-examples
docker build -t retinanet:latest retinanet/
nvidia-docker run --rm --ipc=host -it retinanet:latest
Training, inference, evaluation and model export can be done through the retinanet
utility.
For more details refer to the INFERENCE and TRAINING documentation.
Train a detection model on COCO 2017 from pre-trained backbone:
retinanet train retinanet_rn50fpn.pth --backbone ResNet50FPN \
--images /coco/images/train2017/ --annotations /coco/annotations/instances_train2017.json \
--val-images /coco/images/val2017/ --val-annotations /coco/annotations/instances_val2017.json
Fine-tune a pre-trained model on your dataset. In the example below we use Pascal VOC with JSON annotations:
retinanet train model_mydataset.pth \
--fine-tune retinanet_rn50fpn.pth \
--classes 20 --iters 10000 --val-iters 1000 --lr 0.0005 \
--resize 512 --jitter 480 640 --images /voc/JPEGImages/ \
--annotations /voc/pascal_train2012.json --val-annotations /voc/pascal_val2012.json
Note: the shorter side of the input images will be resized to resize
as long as the longer side doesn't get larger than max-size
. During training, the images will be randomly randomly resized to a new size within the jitter
range.
Evaluate your detection model on COCO 2017:
retinanet infer retinanet_rn50fpn.pth --images /coco/images/val2017/ --annotations /coco/annotations/instances_val2017.json
Run inference on your dataset:
retinanet infer retinanet_rn50fpn.pth --images /dataset/val --output detections.json
For faster inference, export the detection model to an optimized FP16 TensorRT engine:
retinanet export model.pth engine.plan
Note: for older versions of TensorRT (prior to TensorRT 5.1 / 19.03 containers) the ONNX opset version should be specified (using --opset 8
for instance).
Evaluate the model with TensorRT backend on COCO 2017:
retinanet infer engine.plan --images /coco/images/val2017/ --annotations /coco/annotations/instances_val2017.json
For even faster inference, do INT8 calibration to create an optimized INT8 TensorRT engine:
retinanet export model.pth engine.plan --int8 --calibration-images /coco/images/val2017/
This will create an INT8CalibrationTable file that can be used to create INT8 TensorRT engines for the same model later on without needing to do calibration.
Or create an optimized INT8 TensorRT engine using a cached calibration table:
retinanet export model.pth engine.plan --int8 --calibration-table /path/to/INT8CalibrationTable
Training numbers for COCO 2017 (train/val) after full training schedule with default parameters.
Inference numbers include bounding boxes post-processing for batch = 1.
Backbone | Resize | mAP @[IoU=0.50:0.95] | Training Time [DGX1v] | Inference Latency FP16 [V100] | Inference Latency FP16 [T4] | Inference Latency INT8 [T4] |
---|---|---|---|---|---|---|
ResNet18FPN | 800 | 0.318 | 5 hrs | 12 ms/im | 17 ms/im | 12 ms/im |
ResNet34FPN | 800 | 0.343 | 6 hrs | 14 ms/im | 20 ms/im | 14 ms/im |
ResNet50FPN | 800 | 0.358 | 7 hrs | 16 ms/im | 26 ms/im | 16 ms/im |
ResNet101FPN | 800 | 0.376 | 10 hrs | 20 ms/im | 34 ms/im | 20 ms/im |
ResNet152FPN | 800 | 0.393 | 12 hrs | 25 ms/im | 42 ms/im | 24 ms/im |
RetinaNet supports annotations in the COCO JSON format. When converting the annotations from your own dataset into JSON, the following entries are required:
{
"images": [{
"id" : int,
"file_name" : str
}],
"annotations": [{
"id" : int,
"image_id" : int,
"category_id" : int,
"bbox" : [x, y, w, h]
}],
"categories": [{
"id" : int
]}
}
- Focal Loss for Dense Object Detection. Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár. ICCV, 2017.
- Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He. June 2017.
- Feature Pyramid Networks for Object Detection. Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie. CVPR, 2017.
- Deep Residual Learning for Image Recognition. Kaiming He, Xiangyu Zhang, Shaoqing Renm Jian Sun. CVPR, 2016.