ultralytics/yolov3

INCREASING NMS SPEED

glenn-jocher opened this issue · 43 comments

Non Maximal Suppression (NMS) of bounding boxes is a significant speed constraint during testing. I am opening this issue to try to determine options for speeding up this operation. I am going to compare the default NMS method 'MERGE' with two newly available PyTorch methods. If anyone has any additional methods we could test, please post here.

def non_max_suppression(prediction, conf_thres=0.5, nms_thres=0.5):

The test code is below. Hardware is a 2080Ti.

python3 test.py --weights ultralytics68.pt --nms-thres 0.6 --img-size 512 --device 0

UPDATE: THESE ARE OLD RESULTS, SEE BOTTOM OF THREAD FOR IMPROVED RESULTS

Speed
mm:ss
COCO mAP
@0.5...0.95
COCO mAP
@0.5
ultralytics 'OR' 8:20 39.7 60.3
ultralytics 'AND' 7:38 39.6 60.1
ultralytics 'SOFT' 12:00 39.1 58.7
ultralytics 'MERGE' 11:25 40.2 60.4
torchvision.ops.boxes.nms() 5:08 39.7 60.3
torchvision.ops.boxes.batched_nms() 6:00 39.7 60.3

Results of the test is that torchvision.ops.boxes.nms() is fastest but not the highest mAP. Ultralytics MERGE method increases AP + 0.5, so I will leave it for testing (when calling test.py directly using --conf-thres 0.001), and use torchvision.ops.boxes.nms() for calculating mAP when training using --conf-thres 0.10 (to increase training speed).

yolov3/utils/utils.py

Lines 513 to 517 in 1e9ddc5

# Set NMS method https://github.com/ultralytics/yolov3/issues/679
# 'OR', 'AND', 'MERGE', 'VISION', 'VISION_BATCHED'
method = 'MERGE' if conf_thres <= 0.01 else 'VISION' # MERGE is highest mAP, VISION is fastest

I will look more into this during the weekend.

great works!

torchvision. ops implements operators that are specific for Computer Vision. Those operators currently do not support TorchScript. Performs non-maximum suppression (NMS) on the boxes according to their intersection-over-union (IoU)

AttributeError: module 'torchvision' has no attribute 'ops'

what should I do?

@omizonly what is your use case for TorchScript?

@omizonly what is your use case for TorchScript?

tensorflow= 1.3.1

@omizonly I don't understand, can you elaborate? This repo only runs PyTorch, and exports to ONNX for onward use in other formats, however we clearly can not support you with problems in those other formats. I suggest you raise an issue on the PyTorch or TF repos.

I'll close this issue for now as the original issue appears to have been resolved, and/or no activity has been seen for some time. Feel free to comment if this is not the case.

Quick update with latest code on one T4 GPU. Second line is current default.

python3 test.py --weights yolov3-spp-ultralytics.pt --cfg yolov3-spp.cfg --img 608
Time
sec/image
Time
mm:ss
COCO mAP
@0.5...0.95
COCO mAP
@0.5
'vision_batched', multi_cls=False 43ms 3:36 40.2 60.4
'vision_batched', multi_cls=True 48ms 4:01 40.9 61.4
'merge', multi_cls=True 172ms 14:23 41.3 61.7

Is there a way to make the model print the JSON file if it detects an object regardless of classification?

Hi, I saw a Fast NMS proposed by YOLACT. How is it? https://arxiv.org/abs/1912.06218

@Zzh-tju yes that seems an interesting approach. They apply NMS as a matrix operation to remove the for loop, which they say runs much faster with a minimum mAP penalty.

Depending on the conf-thres used, NMS may or may not be a very expensive operation in this repo. For most actual use applications with conf-thres around 0.1-0.9, NMS is not a speed concern, taking <10% of the total processing time for an image, but when calculating mAP near conf-thres = 0.0001 for example, NMS may take up 90% of the processing time.

If you can try to implement a fast NMS experiment here that would be very useful. The NMS function is here. In the meantime I will update this thread with the latest speeds on a T4 colab instance.

yolov3/utils/utils.py

Lines 504 to 512 in dce753e

def non_max_suppression(prediction, conf_thres=0.5, iou_thres=0.5, multi_cls=True, classes=None, agnostic=False):
"""
Removes detections with lower object confidence score than 'conf_thres'
Non-Maximum Suppression to further filter detections.
Returns detections with shape:
(x1, y1, x2, y2, object_conf, conf, class)
"""
# NMS methods https://github.com/ultralytics/yolov3/issues/679 'or', 'and', 'merge', 'vision', 'vision_batch'

UPDATE: I've posted an issue on yolact repo for this dbolya/yolact#366 (comment)

Update: I discovered a majority of time in test.py was spent building pycocotools JSON files for official mAPs. If I turn off this functionality (compute mAP only with repo code) I get the following times for the 5k COCO2014 val images. Machine is a 12-vCPU V100 instance.

python3 test.py --weights yolov3-spp-ultralytics.pt --cfg yolov3-spp --img 608
NMS method Time
ms/img
Time
mm:ss
mAP
@0.5:0.95
mAP
@0.5
'vision_batched' (default) 15.2 ms 1:16 41.9 61.8
'merge' 103 ms 8:35 42.3 62.0
'fast_batched' 14.6 ms 1:13 41.5 61.5

@Zzh-tju FastNMS updates have been committed and pushed now after testing.

yolov3/utils/utils.py

Lines 564 to 571 in f915bf1

elif method == 'fast_batch': # FastNMS from https://github.com/dbolya/yolact
boxes += c.view(-1, 1) * max_wh
iou = box_iou(boxes, boxes).triu_(diagonal=1) # zero upper triangle iou matrix
i = iou.max(dim=0)[0] < iou_thres
output[image_i] = pred[i]
continue

@Zzh-tju to clear up the timing a bit more, I added profiling code to test.py that specifically tracks inference and NMS times in e482392. This can be accessed with the --profile flag:

python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile

I ran with both default torchvision NMS and the yolact FastNMS, and actually saw a slight speed decrease with FastNMS:

Default: Profile results: 1.3/6.9/8.1 ms inference/NMS/total per image
FastNMS: Profile results: 1.3/7.1/8.4 ms inference/NMS/total per image

So perhaps the slight speed increase from FastNMS observed in the total test time is due simply to a reduced box count produced by this NMS method, which results in less postprocessing work during testing (mAP calculation etc.).

The other surprise was the great amount of total time spent on NMS vs inference. Even under the default settings 6.9/8.1 = 85% of the total time is spent on NMS!

CORRECTION: My previous analysis was incorrect, it lacked the torch.cuda.synchronize()
operations necessary when profiling cuda operations. I've fixed this in 1430a1e. Corrected results, consistent across several runs:

python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile

Default: Profile results: 6.6/1.6/8.2 ms inference/NMS/total per image
FastNMS: Profile results: 6.6/1.9/8.5 ms inference/NMS/total per image

Conclusion is that inference uses most (80%) of the runtime in both cases, and that FastNMS appears to run slightly slower than default torchvision.ops.boxes.batched_nms().

Inference can be sped up with larger batch sizes, but NMS is run per image in all cases, so the only ways to affect it's speed currently are here. Note that the 1.6 ms profile time uses all default settings though (none of these speedups are applied).

  • Increase your conf_thres
  • Turn off multi_cls
  • Decrease iou_thres
    def non_max_suppression(prediction, conf_thres=0.1, iou_thres=0.6, multi_cls=True, classes=None, agnostic=False):

Running a few tests to document effects on speed. These are with a V100 from a docker container, which is slightly slower than running natively.

python3 test.py --cfg yolov3-spp.cfg --weights yolov3-spp-ultralytics.pt --img 608

rect=False
cudnn.deterministic=True, cudnn.benchmark = False:
12.9/1.8/14.8 ms inference/NMS/total per 608x608 image at batch-size 32
cudnn.deterministic=False, cudnn.benchmark = False:
9.9/1.7/11.6 ms inference/NMS/total per 608x608 image at batch-size 32
cudnn.deterministic=False, cudnn.benchmark = True:
9.5/1.7/11.1 ms inference/NMS/total per 608x608 image at batch-size 32

rect=True
cudnn.deterministic=True, cudnn.benchmark = False:
9.8/1.7/11.5 ms inference/NMS/total per 608x608 image at batch-size 32
cudnn.deterministic=False, cudnn.benchmark = False: (default)
6.8/1.7/8.6 ms inference/NMS/total per 608x608 image at batch-size 32
cudnn.deterministic=False, cudnn.benchmark = True:
18.2/1.7/19.9 ms inference/NMS/total per 608x608 image at batch-size 32
cudnn.deterministic=False, cudnn.benchmark = False, bs64
7.0/1.7/8.8 ms inference/NMS/total per 608x608 image at batch-size 64
cudnn.deterministic=False, cudnn.benchmark = False, bs1
14.0/2.0/16.0 ms inference/NMS/total per 608x608 image at batch-size 1
cudnn.deterministic=False, cudnn.benchmark = False, no contiguous() in models.py L207
6.8/1.7/8.5 ms inference/NMS/total per 608x608 image at batch-size 32
cudnn.deterministic=False, cudnn.benchmark = False, no contiguous(), reshape in models.py L207
6.8/1.7/8.5 ms inference/NMS/total per 608x608 image at batch-size 32

Running default natively:
Speed: 6.7/1.6/8.2 ms inference/NMS/total per 608x608 image at batch-size 32
no contiguous():
Speed: 6.6/1.6/8.2 ms inference/NMS/total per 608x608 image at batch-size 32
no contiguous() bs1:
Speed: 12.8/1.8/14.6 ms inference/NMS/total per 608x608 image at batch-size 1
yes contiguous() bs1:
Speed: 12.7/1.8/14.5 ms inference/NMS/total per 608x608 image at batch-size 1
no contiguous() bs1 img-size 512
Speed: 12.5/1.8/14.3 ms inference/NMS/total per 512x512 image at batch-size 1
no contiguous() bs1 img-size 416
Speed: 12.8/1.8/14.6 ms inference/NMS/total per 416x416 image at batch-size 1
no contiguous() bs1 img-size 608 yolov3-tiny
Speed: 3.2/1.8/4.9 ms inference/NMS/total per 608x608 image at batch-size 1

V100:
Speed: 6.6/1.5/8.1 ms inference/NMS/total per 608x608 image at batch-size 32
Speed: 17.2/1.5/18.8 ms inference/NMS/total per 800x800 image at batch-size 1
Speed: 11.8/1.5/13.3 ms inference/NMS/total per 608x608 image at batch-size 1
Speed: 11.6/1.5/13.1 ms inference/NMS/total per 512x512 image at batch-size 1
Speed: 11.6/1.5/13.1 ms inference/NMS/total per 416x416 image at batch-size 1
Speed: 11.6/1.5/13.1 ms inference/NMS/total per 320x320 image at batch-size 1

2080Ti:
Speed: 9.2/1.2/10.4 ms inference/NMS/total per 608x608 image at batch-size 32
Speed: 13.9/1.5/15.4 ms inference/NMS/total per 608x608 image at batch-size 1

CPU:
Speed: 753.0/2.9/756.0 ms inference/NMS/total per 608x608 image at batch-size 1

batch_size=32 means testing 32 images simultaneously including NMS?

@Zzh-tju batch-size 32 means for example a 32x3x608x608 tensor is passed to the model for inference. The inference outputs are passed to NMS, which operates sequentially over the images:
for img in range(32):

def non_max_suppression(prediction, conf_thres=0.1, iou_thres=0.6, multi_label=True, classes=None, agnostic=False):

Test-time augmentation study #931:

Default + 0 ops: 11.8/1.5/13.3 ms inference/NMS/total per 608x608 image at batch-size 1
Default + 1 ops: 18.7/1.6/20.3 ms inference/NMS/total per 608x608 image at batch-size 1
Default + 2 ops: 26.4/1.8/28.2 ms inference/NMS/total per 608x608 image at batch-size 1

Updated V100 speeds with fused inference:
Speed: 11.1/1.7/12.8 ms inference/NMS/total per 608x608 image at batch-size 1 NEW RECORD
Speed: 6.5/1.5/8.1 ms inference/NMS/total per 608x608 image at batch-size 32 NEW RECORD

Default + 0 ops: 11.1/1.7/12.8 ms inference/NMS/total per 608x608 image at batch-size 1
Default + 2 ops: 26.1/1.9/28.1 ms inference/NMS/total per 608x608 image at batch-size 1

SOLOv2 Table 7: Matrix NMS:
https://arxiv.org/pdf/2003.10152.pdf

Screen Shot 2020-03-25 at 5 47 37 PM

Screen Shot 2020-03-25 at 5 47 50 PM

UPDATE: Unable to reproduce using this code:

            elif method == 'matrix_batch':  # Matrix NMS from https://arxiv.org/abs/2003.10152
                iou = box_iou(boxes, boxes).triu_(diagonal=1)  # upper triangular iou matrix
                m = iou.max(0)[0].view(-1, 1)  # max values
                decay = torch.exp(-(iou ** 2 - m ** 2) / 0.5).min(0)[0]  # gauss with sigma=0.5
                scores *= decay
                i = torch.full((boxes.shape[0],), fill_value=1).bool()

torchvision. ops implements operators that are specific for Computer Vision. Those operators currently do not support TorchScript. Performs non-maximum suppression (NMS) on the boxes according to their intersection-over-union (IoU)

AttributeError: module 'torchvision' has no attribute 'ops'

what should I do?

Have you solved it? I met the same problems

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

@glenn-jocher Hi, could you tell me why we cannot do NMS cross batches. Currently, NMS is done on images one by one. However, we turn on batch testing.

The number of detections from different images are different, is it the reason why we cannot perform real batch NMS?

@Zzh-tju feel free to play around with the NMS code and try your idea out. If you see performance improvements please submit a PR! Thank you.

@glenn-jocher Now, I just figured out a speed improvement. And will give you a PR later. You can try it and give it more optimization.

Because Torchvision NMS cannot run across images mode. (if we add image related offset for boxes, it will enlarge the size of IoU matrix quadratically). So I have to try Cluster-NMS. I keep the preprocessing of NMS unchanged, and just replace the core part of your merge nms with Cluster-Weighted NMS.

Batch Size torchvision merge nms batch mode Cluster-Weighted NMS Cluster-Weighted NMS
AP - 42.9 42.9 42.9
time 4 3.0ms 4.4ms 5.5ms
time 32 2.3ms 3.0ms 4.7ms

Now I want to ask you why with batchsize increase, NMS time decrease? (for torchvision nms)
What's the max batchsize can we use? I run on 2 2080Ti GPUs. Batchsize 32 takes me about 6~7 GB memory per GPU.
I guess if we continue to increase batchsize when testing, it may be benefited more by batch mode Cluster-NMS series.
However, limited by my personal code ability, it might be possible to optimize the code better.

I think maybe the best way is to intergrate the preprocessing of NMS into batch mode either, even if it will bring us a slight performance drop. Now it takes about 1.3~1.5ms for preprocessing. And just 0.8 ms for your torchvision merge NMS. It still room for accelarating.

@Zzh-tju ah! Thanks for the interesting study. We've actually discovered that in yolov5 the regression is improved enough that we can stop using merge, and simply use the default pytorch NMS to get the same results. So the current NMS strategy we have is in yolov5 function is not to use merge anymore.

It is an interesting idea to do a batched NMS approach instead of calling the nms function once per image. Your results show a significant improvement, 2.3 / 3.0 is about 25% faster (!). This would make a huge improvement on yolov5s for example, which has inference time of 2.1ms per image at batch-size 32 FP16, about half of which is used up with NMS. See speeds here. NMS is about 1 ms per image in these numbers, so a 25% speedup there would be noticeable in the table.
https://github.com/ultralytics/yolov5#pretrained-checkpoints

Right now the boxes are offset by (class * max_image_size) to get batched per image (so different classes never overlap). I suppose to run once per batch we would offset boxes by (class * max_image_size * image_index)? Are you using torchvision.ops.nms() or torchvision.ops._batched_nms()?

@glenn-jocher no, you misunderstand me. My question is why with batchsize increase, NMS speed increase either?

@Zzh-tju in my experiments with yolov5, NMS speed is the same no matter the batch size. For example from the notebook:

!python test.py --weights yolov5s.pt --data coco128.yaml --img 640 --batch 1
!python test.py --weights yolov5s.pt --data coco128.yaml --img 640 --batch 8
!python test.py --weights yolov5s.pt --data coco128.yaml --img 640 --batch 32

Output:

Namespace(augment=False, batch_size=1, conf_thres=0.001, data='./data/coco128.yaml', device='', img_size=640, iou_thres=0.65, merge=False, save_json=False, single_cls=False, task='val', verbose=False, weights='yolov5s.pt')
Using CUDA device0 _CudaDeviceProperties(name='Tesla T4', total_memory=15079MB)

Model Summary: 191 layers, 7.46816e+06 parameters, 7.46816e+06 gradients
Fusing layers...
Model Summary: 140 layers, 7.45958e+06 parameters, 7.45958e+06 gradients
Caching labels ../coco128/labels/train2017 (126 found, 0 missing, 2 empty, 0 duplicate, for 128 images): 100% 128/128 [00:00<00:00, 8725.21it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100% 128/128 [00:03<00:00, 37.63it/s]
                 all         128         929       0.379        0.74       0.676        0.44
Speed: 9.3/1.8/11.1 ms inference/NMS/total per 640x640 image at batch-size 1


Namespace(augment=False, batch_size=8, conf_thres=0.001, data='./data/coco128.yaml', device='', img_size=640, iou_thres=0.65, merge=False, save_json=False, single_cls=False, task='val', verbose=False, weights='yolov5s.pt')
Using CUDA device0 _CudaDeviceProperties(name='Tesla T4', total_memory=15079MB)

Model Summary: 191 layers, 7.46816e+06 parameters, 7.46816e+06 gradients
Fusing layers...
Model Summary: 140 layers, 7.45958e+06 parameters, 7.45958e+06 gradients
Caching labels ../coco128/labels/train2017 (126 found, 0 missing, 2 empty, 0 duplicate, for 128 images): 100% 128/128 [00:00<00:00, 5722.17it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100% 16/16 [00:02<00:00,  5.41it/s]
                 all         128         929       0.381       0.744        0.68       0.442
Speed: 4.1/2.2/6.3 ms inference/NMS/total per 640x640 image at batch-size 8


Namespace(augment=False, batch_size=32, conf_thres=0.001, data='./data/coco128.yaml', device='', img_size=640, iou_thres=0.65, merge=False, save_json=False, single_cls=False, task='val', verbose=False, weights='yolov5s.pt')
Using CUDA device0 _CudaDeviceProperties(name='Tesla T4', total_memory=15079MB)

Model Summary: 191 layers, 7.46816e+06 parameters, 7.46816e+06 gradients
Fusing layers...
Model Summary: 140 layers, 7.45958e+06 parameters, 7.45958e+06 gradients
Caching labels ../coco128/labels/train2017 (126 found, 0 missing, 2 empty, 0 duplicate, for 128 images): 100% 128/128 [00:00<00:00, 9776.04it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100% 4/4 [00:04<00:00,  1.12s/it]
                 all         128         929       0.385       0.752       0.692       0.452
Speed: 4.2/2.1/6.3 ms inference/NMS/total per 640x640 image at batch-size 32

So 1.8ms, 2.2ms, 2.1ms at batch sizes 1, 8, 32. Basically NMS speed per image is not correlated to batch size.

got it @glenn-jocher , I will do more test with batchsize.

@glenn-jocher Hi, I have just finished a marginal work about Batch Mode Weighted Cluster-NMS for speeding up NMS. You can check https://github.com/Zzh-tju/yolov5 for details. My conclusion is Batch mode Weighted Cluster-NMS will benefit us when TTA is used.

@Zzh-tju ah, very interesting! I'll check out the forked repo.

@Zzh-tju I looked things over. You've clearly done a lot of work and experimentation!

I see it's hard to provide substantial gains off of the basic NMS unfortunately. I think this is because box regression is improving over past works, so perhaps the gains presented by merging two 0.90 iou boxes are less than for example merging two 0.5 iou boxes. It's unfortunate, because actually one of the yolov5 changes is increased grid sensetivity. In yolov3, only one cell per output layer could trigger on an object. In yolov5, >=3 cells per output layer always trigger per object (the nearest 3), so I'd expect many more boxes being proposed by yolov5 than by yolov3. It's frustrating that there isn't a better way to exploit all these extra statistics.

One very interesting piece of information I found out during the TTA and Ensembling work, I discovered that merging output grids always produced better results than appending output boxes togethor. If you look at the YOLOv5 ensembling module you will see that there are 3 options:
https://github.com/ultralytics/yolov5/blob/cab36f72a852ef00e8b42d3283ba9b2fc757b17f/models/experimental.py#L117-L129

  • mean ensemble: performs mean() of all output grids, i.e. YOLOv5s output small output grid and YOLOv5m small output grid are the same shape, this takes the mean() of the two grids. Best results.
  • max ensemble: same as mean(), but applies max(). Poor results.
  • nms ensemble: appends all output boxes togethor for NMS to sort out. Ok results.

If there was a way to mean() TTA output grids the way that mean ensemble works, this might produce the best results, but it is very complicated due to the varying output shapes unfortunately, so abandoned this effort.

@glenn-jocher wait a second, why do TTA output grids have different shape of outputs?

@glenn-jocher And I did saw an improvement when merging two 0.8 IoU boxes rather than two 0.65 boxes.

@Zzh-tju ensemble output grids will have the same shape, for example if you run both YOLOv5s and YOLOv5m at the same image size, the 3 output grids from YOLOv5s are the same size as from YOLOv5m.

TTA uses different inference sizes as part of it's augmentation, so naturally the output grids will change in size, and can no longer be directly meaned.

Hmm, interesting, 0.8 IoU is higher than I've ever tried. I think the more accurate the box regressions, the higher you can raise the IoU threshold. What was the improvement you saw using 0.8 IoU?

@glenn-jocher see the results in https://github.com/Zzh-tju/yolov5. weighted threshold is the merging threshold

@glenn-jocher
mmexport1599542292807
Do you mean with input size change, the size of output grid map will change too?

@Zzh-tju yes. YOLOv5 strides are 8, 16, 32 on the small, medium and large object output layers. So a 640x640 image will have 3 output grids of size 20x20, 40x40, 80x80.

The same output grids for a 320x320 image are 10x10, 20x20, 40x40.