This is a work in progress PyTorch implementation of EfficientDet.
It is based on the
- official Tensorflow implementation by Mingxing Tan and the Google Brain team
- paper by Mingxing Tan, Ruoming Pang, Quoc V. Le EfficientDet: Scalable and Efficient Object Detection
I am aware there are other PyTorch implementations. Their approach didn't fit well with my aim to replicate the Tensorflow models closely enough to allow weight ports while still maintaining a PyTorch feel and a high degree of flexibility for future additions. So, this is built from scratch and leverages my previous EfficientNet work.
Initial D1 training results in -- close but not quite there. Definitely in reach and better than any other non-official EfficientDet impl I've seen.
Biggest missing element is proper per-epoch mAP validation for better checkpoint selection (than loss based). I was resisting doing full COCO eval because it's so slow, but may throw that in for now...
D1:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.382
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.577
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.407
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.190
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.437
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.552
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.314
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.489
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.520
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.286
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.591
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.713
Previous D0 result:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.324
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.513
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.342
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.121
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.383
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.499
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.280
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.426
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.452
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.188
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.532
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.668
First decent MSCOCO training results (from scratch, w/ pretrained classification backbone weights as starting point). 32.4 mAP for D0. Working on improvements and D1 trials still running.
Taking a pause on training, some high priority things came up. There are signs of life on the training branch, was working the basic augs before priority switch, loss fn appeared to be doing something sane with distributed training working, no proper eval yet, init not correct yet. I will get to it, with SOTA training config and good performance as the end goal (as with my EfficientNet work).
Cleanup post-processing. Less code and a five-fold throughput increase on the smaller models. D0 running > 130 img/s on a single 2080Ti, D1 > 130 img/s on dual 2080Ti up to D7 @ 8.5 img/s.
Replace generate_detections
with PyTorch impl using torchvision batched_nms. Significant performance increase with minor (+/-.001 mAP) score differences. Quite a bit faster than original TF impl on a GPU now.
Initial code with working validation posted. Yes, it's a little slow, but I think faster than the official impl on a GPU if you leave AMP enabled. Post processing needs some love.
- Feature extraction from my EfficientNet implementations (https://github.com/rwightman/gen-efficientnet-pytorch or https://github.com/rwightman/pytorch-image-models)
- Low level blocks / helpers (SeparableConv, create_pool2d (same padding), etc)
- PyTorch implementation of BiFPN, BoxNet, ClassNet modules and related submodules
- Port Tensorflow checkpoints to PyTorch -- initial D1 checkpoint converted, state_dict loaded, on to validation....
- Basic MS COCO validation script
- Temporary (hacky) COCO dataset and transform
- Port reference TF anchor and object detection code
- Verify model output sanity
- Integrate MSCOCO eval metric calcs
- Some cleanup, testing
- Submit to test-dev server, all good
- Add torch hub support and pretrained URL based weight download
- Remove redundant bias layers that exist in the official impl and weights
- Add visualization support
- Performance improvements, numpy TF detection code -> optimized PyTorch
- Verify/fix Torchscript and ONNX export compatibility
- Try PyTorch 1.5 w/ NHWC (channels last) order which matches TF impl
- Basic Training (object detection) reimplementation (still need proper per epoch validation)
- Advanced Training w/ Rand/AutoAugment, etc
- Training (semantic segmentation) experiments
- Integration with Detectron2 / MMDetection codebases
- Addition and cleanup of EfficientNet based U-Net and DeepLab segmentation models that I've used in past projects
- Addition and cleanup of OpenImages dataset/training support from a past project
- Exploration of instance segmentation possibilities...
If you are an organization is interested in sponsoring and any of this work, or prioritization of the possible future directions interests you, feel free to contact me (issue, LinkedIn, Twitter, hello at rwightman dot com). I will setup a github sponser if there is any interest.
Variant | Download | mAP (val2017) | mAP (test-dev2017) | mAP (Tensorflow official test-dev2017) |
---|---|---|---|---|
D0 | tf_efficientdet_d0.pth | 32.8 | TBD | 33.8 |
D1 | tf_efficientdet_d1.pth | 38.5 | TBD | 39.6 |
D2 | tf_efficientdet_d2.pth | 42.0 | 42.5 | 43 |
D3 | tf_efficientdet_d3.pth | 45.3 | TBD | 45.8 |
D4 | tf_efficientdet_d4.pth | 48.3 | TBD | 49.4 |
D5 | tf_efficientdet_d5.pth | 49.6 | TBD | 50.7 |
D6 | tf_efficientdet_d6.pth | 50.6 | TBD | 51.7 |
D7 | tf_efficientdet_d7.pth | 50.9 | 51.2 | 52.2 |
Tested in a Python 3.7 or 3.8 conda environment in Linux with:
- PyTorch 1.4
- PyTorch Image Models (timm) 0.1.20,
pip install timm
or local install from (https://github.com/rwightman/pytorch-image-models) - Apex AMP master (as of 2020-04)
NOTE - There is a conflict/bug with Numpy 1.18+ and pycocotools, force install numpy <= 1.17.5 or the coco eval will fail, the validation script will still save the output JSON and that can be run through eval again later.
MSCOCO 2017 validation data:
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip val2017.zip
unzip annotations_trainval2017.zip
MSCOCO 2017 test-dev data:
wget http://images.cocodataset.org/zips/test2017.zip
unzip -q test2017.zip
wget http://images.cocodataset.org/annotations/image_info_test2017.zip
unzip image_info_test2017.zip
Run validation (val2017 by default) with D2 model: python validation.py /localtion/of/mscoco/ --model tf_efficientdet_d2 --checkpoint tf_efficientdet_d2.pth
Run test-dev2017: python validation.py /localtion/of/mscoco/ --model tf_efficientdet_d2 --checkpoint tf_efficientdet_d2.pth --anno test-dev2017
TODO: Need an inference script
./distributed_train.sh 2 /mscoco --model tf_efficientdet_d0 -b 16 --amp --lr .05 --warmup-epochs 5 --sync-bn --opt fusedmomentum --fill-color mean --model-ema
NOTE:
- Training script currently defaults to a model that does NOT have redundant conv + BN bias layers like the official models, set correct flag when validating.
- I've only trained with img mean (
--fill-color mean
) as the background for crop/scale/aspect fill, the official repo uses black pixel (0) (--fill-color 0
). Both likely work fine. - The official training code uses EMA weight averaging by default, it's not clear there is a point in doing this with the cosine LR schedule, I find the non-EMA weights end up better than EMA in the last 10-20% of training epochs
- The default h-params is a very close to unstable (exploding loss), don't try using Nesterov momentum. Try to keep the batch size up, use sync-bn.
NOTE: I've only tried submitting D2 and D7 to dev server for sanity check so far
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.425
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.618
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.457
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.222
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.467
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.578
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.341
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.536
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.569
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.334
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.627
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.749
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.512
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.707
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.554
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.335
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.549
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.639
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.385
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.623
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.660
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.485
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.696
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.800
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.328
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.509
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.346
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.118
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.380
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.520
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.283
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.432
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.457
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.185
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.539
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.679
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.385
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.575
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.410
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.179
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.441
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.561
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.318
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.491
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.519
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.278
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.589
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.721
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.420
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.612
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.448
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.232
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.476
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.582
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.338
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.531
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.562
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.340
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.627
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.741
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.453
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.646
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.485
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.272
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.498
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.614
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.355
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.563
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.597
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.399
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.647
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.762
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.483
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.680
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.521
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.325
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.535
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.634
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.371
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.591
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.626
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.453
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.676
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.772
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.496
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.692
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.533
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.327
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.543
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.642
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.377
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.608
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.644
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.469
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.688
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.785
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.506
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.701
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.545
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.336
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.552
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.650
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.382
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.616
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.653
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.477
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.697
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.791
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.509
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.705
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.548
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.353
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.559
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.648
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.381
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.617
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.654
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.489
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.696
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.791