INCREASING NMS SPEED
glenn-jocher opened this issue · 43 comments
Non Maximal Suppression (NMS) of bounding boxes is a significant speed constraint during testing. I am opening this issue to try to determine options for speeding up this operation. I am going to compare the default NMS method 'MERGE'
with two newly available PyTorch methods. If anyone has any additional methods we could test, please post here.
Line 456 in cadd2f7
The test code is below. Hardware is a 2080Ti.
python3 test.py --weights ultralytics68.pt --nms-thres 0.6 --img-size 512 --device 0
UPDATE: THESE ARE OLD RESULTS, SEE BOTTOM OF THREAD FOR IMPROVED RESULTS
Speed mm:ss |
COCO mAP @0.5...0.95 |
COCO mAP @0.5 |
|
---|---|---|---|
ultralytics 'OR' |
8:20 | 39.7 | 60.3 |
ultralytics 'AND' |
7:38 | 39.6 | 60.1 |
ultralytics 'SOFT' |
12:00 | 39.1 | 58.7 |
ultralytics 'MERGE' |
11:25 | 40.2 | 60.4 |
torchvision.ops.boxes.nms() | 5:08 | 39.7 | 60.3 |
torchvision.ops.boxes.batched_nms() | 6:00 | 39.7 | 60.3 |
Results of the test is that torchvision.ops.boxes.nms() is fastest but not the highest mAP. Ultralytics MERGE method increases AP + 0.5, so I will leave it for testing (when calling test.py directly using --conf-thres 0.001
), and use torchvision.ops.boxes.nms() for calculating mAP when training using --conf-thres 0.10
(to increase training speed).
Lines 513 to 517 in 1e9ddc5
I will look more into this during the weekend.
great works!
torchvision. ops implements operators that are specific for Computer Vision. Those operators currently do not support TorchScript. Performs non-maximum suppression (NMS) on the boxes according to their intersection-over-union (IoU)
AttributeError: module 'torchvision' has no attribute 'ops'
what should I do?
@omizonly what is your use case for TorchScript?
@omizonly I don't understand, can you elaborate? This repo only runs PyTorch, and exports to ONNX for onward use in other formats, however we clearly can not support you with problems in those other formats. I suggest you raise an issue on the PyTorch or TF repos.
I'll close this issue for now as the original issue appears to have been resolved, and/or no activity has been seen for some time. Feel free to comment if this is not the case.
Quick update with latest code on one T4 GPU. Second line is current default.
python3 test.py --weights yolov3-spp-ultralytics.pt --cfg yolov3-spp.cfg --img 608
Time sec/image |
Time mm:ss |
COCO mAP @0.5...0.95 |
COCO mAP @0.5 |
|
---|---|---|---|---|
'vision_batched', multi_cls=False |
43ms | 3:36 | 40.2 | 60.4 |
'vision_batched', multi_cls=True |
48ms | 4:01 | 40.9 | 61.4 |
'merge', multi_cls=True |
172ms | 14:23 | 41.3 | 61.7 |
Is there a way to make the model print the JSON file if it detects an object regardless of classification?
Hi, I saw a Fast NMS proposed by YOLACT. How is it? https://arxiv.org/abs/1912.06218
@Zzh-tju yes that seems an interesting approach. They apply NMS as a matrix operation to remove the for
loop, which they say runs much faster with a minimum mAP penalty.
Depending on the conf-thres used, NMS may or may not be a very expensive operation in this repo. For most actual use applications with conf-thres around 0.1-0.9, NMS is not a speed concern, taking <10% of the total processing time for an image, but when calculating mAP near conf-thres = 0.0001 for example, NMS may take up 90% of the processing time.
If you can try to implement a fast NMS experiment here that would be very useful. The NMS function is here. In the meantime I will update this thread with the latest speeds on a T4 colab instance.
Lines 504 to 512 in dce753e
UPDATE: I've posted an issue on yolact repo for this dbolya/yolact#366 (comment)
Update: I discovered a majority of time in test.py was spent building pycocotools JSON files for official mAPs. If I turn off this functionality (compute mAP only with repo code) I get the following times for the 5k COCO2014 val images. Machine is a 12-vCPU V100 instance.
python3 test.py --weights yolov3-spp-ultralytics.pt --cfg yolov3-spp --img 608
NMS method | Time ms/img |
Time mm:ss |
mAP @0.5:0.95 |
mAP @0.5 |
---|---|---|---|---|
'vision_batched' (default) |
15.2 ms | 1:16 | 41.9 | 61.8 |
'merge' |
103 ms | 8:35 | 42.3 | 62.0 |
'fast_batched' |
14.6 ms | 1:13 | 41.5 | 61.5 |
@Zzh-tju to clear up the timing a bit more, I added profiling code to test.py that specifically tracks inference and NMS times in e482392. This can be accessed with the --profile
flag:
python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile
I ran with both default torchvision NMS and the yolact FastNMS, and actually saw a slight speed decrease with FastNMS:
Default: Profile results: 1.3/6.9/8.1 ms inference/NMS/total per image
FastNMS: Profile results: 1.3/7.1/8.4 ms inference/NMS/total per image
So perhaps the slight speed increase from FastNMS observed in the total test time is due simply to a reduced box count produced by this NMS method, which results in less postprocessing work during testing (mAP calculation etc.).
The other surprise was the great amount of total time spent on NMS vs inference. Even under the default settings 6.9/8.1 = 85% of the total time is spent on NMS!
CORRECTION: My previous analysis was incorrect, it lacked the torch.cuda.synchronize()
operations necessary when profiling cuda operations. I've fixed this in 1430a1e. Corrected results, consistent across several runs:
python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile
Default: Profile results: 6.6/1.6/8.2 ms inference/NMS/total per image
FastNMS: Profile results: 6.6/1.9/8.5 ms inference/NMS/total per image
Conclusion is that inference uses most (80%) of the runtime in both cases, and that FastNMS appears to run slightly slower than default torchvision.ops.boxes.batched_nms()
.
Inference can be sped up with larger batch sizes, but NMS is run per image in all cases, so the only ways to affect it's speed currently are here. Note that the 1.6 ms profile time uses all default settings though (none of these speedups are applied).
- Increase your
conf_thres
- Turn off
multi_cls
- Decrease
iou_thres
Line 504 in 1dc1761
Running a few tests to document effects on speed. These are with a V100 from a docker container, which is slightly slower than running natively.
python3 test.py --cfg yolov3-spp.cfg --weights yolov3-spp-ultralytics.pt --img 608
rect=False
cudnn.deterministic=True, cudnn.benchmark = False:
12.9/1.8/14.8 ms inference/NMS/total per 608x608 image at batch-size 32
cudnn.deterministic=False, cudnn.benchmark = False:
9.9/1.7/11.6 ms inference/NMS/total per 608x608 image at batch-size 32
cudnn.deterministic=False, cudnn.benchmark = True:
9.5/1.7/11.1 ms inference/NMS/total per 608x608 image at batch-size 32
rect=True
cudnn.deterministic=True, cudnn.benchmark = False:
9.8/1.7/11.5 ms inference/NMS/total per 608x608 image at batch-size 32
cudnn.deterministic=False, cudnn.benchmark = False: (default)
6.8/1.7/8.6 ms inference/NMS/total per 608x608 image at batch-size 32
cudnn.deterministic=False, cudnn.benchmark = True:
18.2/1.7/19.9 ms inference/NMS/total per 608x608 image at batch-size 32
cudnn.deterministic=False, cudnn.benchmark = False, bs64
7.0/1.7/8.8 ms inference/NMS/total per 608x608 image at batch-size 64
cudnn.deterministic=False, cudnn.benchmark = False, bs1
14.0/2.0/16.0 ms inference/NMS/total per 608x608 image at batch-size 1
cudnn.deterministic=False, cudnn.benchmark = False, no contiguous() in models.py L207
6.8/1.7/8.5 ms inference/NMS/total per 608x608 image at batch-size 32
cudnn.deterministic=False, cudnn.benchmark = False, no contiguous(), reshape in models.py L207
6.8/1.7/8.5 ms inference/NMS/total per 608x608 image at batch-size 32
Running default natively:
Speed: 6.7/1.6/8.2 ms inference/NMS/total per 608x608 image at batch-size 32
no contiguous():
Speed: 6.6/1.6/8.2 ms inference/NMS/total per 608x608 image at batch-size 32
no contiguous() bs1:
Speed: 12.8/1.8/14.6 ms inference/NMS/total per 608x608 image at batch-size 1
yes contiguous() bs1:
Speed: 12.7/1.8/14.5 ms inference/NMS/total per 608x608 image at batch-size 1
no contiguous() bs1 img-size 512
Speed: 12.5/1.8/14.3 ms inference/NMS/total per 512x512 image at batch-size 1
no contiguous() bs1 img-size 416
Speed: 12.8/1.8/14.6 ms inference/NMS/total per 416x416 image at batch-size 1
no contiguous() bs1 img-size 608 yolov3-tiny
Speed: 3.2/1.8/4.9 ms inference/NMS/total per 608x608 image at batch-size 1
V100:
Speed: 6.6/1.5/8.1 ms inference/NMS/total per 608x608 image at batch-size 32
Speed: 17.2/1.5/18.8 ms inference/NMS/total per 800x800 image at batch-size 1
Speed: 11.8/1.5/13.3 ms inference/NMS/total per 608x608 image at batch-size 1
Speed: 11.6/1.5/13.1 ms inference/NMS/total per 512x512 image at batch-size 1
Speed: 11.6/1.5/13.1 ms inference/NMS/total per 416x416 image at batch-size 1
Speed: 11.6/1.5/13.1 ms inference/NMS/total per 320x320 image at batch-size 1
2080Ti:
Speed: 9.2/1.2/10.4 ms inference/NMS/total per 608x608 image at batch-size 32
Speed: 13.9/1.5/15.4 ms inference/NMS/total per 608x608 image at batch-size 1
CPU:
Speed: 753.0/2.9/756.0 ms inference/NMS/total per 608x608 image at batch-size 1
batch_size=32 means testing 32 images simultaneously including NMS?
Test-time augmentation study #931:
Default + 0 ops: 11.8/1.5/13.3 ms inference/NMS/total per 608x608 image at batch-size 1
Default + 1 ops: 18.7/1.6/20.3 ms inference/NMS/total per 608x608 image at batch-size 1
Default + 2 ops: 26.4/1.8/28.2 ms inference/NMS/total per 608x608 image at batch-size 1
Updated V100 speeds with fused inference:
Speed: 11.1/1.7/12.8 ms inference/NMS/total per 608x608 image at batch-size 1 NEW RECORD
Speed: 6.5/1.5/8.1 ms inference/NMS/total per 608x608 image at batch-size 32 NEW RECORD
Default + 0 ops: 11.1/1.7/12.8 ms inference/NMS/total per 608x608 image at batch-size 1
Default + 2 ops: 26.1/1.9/28.1 ms inference/NMS/total per 608x608 image at batch-size 1
SOLOv2 Table 7: Matrix NMS:
https://arxiv.org/pdf/2003.10152.pdf
UPDATE: Unable to reproduce using this code:
elif method == 'matrix_batch': # Matrix NMS from https://arxiv.org/abs/2003.10152
iou = box_iou(boxes, boxes).triu_(diagonal=1) # upper triangular iou matrix
m = iou.max(0)[0].view(-1, 1) # max values
decay = torch.exp(-(iou ** 2 - m ** 2) / 0.5).min(0)[0] # gauss with sigma=0.5
scores *= decay
i = torch.full((boxes.shape[0],), fill_value=1).bool()
torchvision. ops implements operators that are specific for Computer Vision. Those operators currently do not support TorchScript. Performs non-maximum suppression (NMS) on the boxes according to their intersection-over-union (IoU)
AttributeError: module 'torchvision' has no attribute 'ops'
what should I do?
Have you solved it? I met the same problems
This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.
@glenn-jocher Hi, could you tell me why we cannot do NMS cross batches. Currently, NMS is done on images one by one. However, we turn on batch testing.
The number of detections from different images are different, is it the reason why we cannot perform real batch NMS?
@Zzh-tju feel free to play around with the NMS code and try your idea out. If you see performance improvements please submit a PR! Thank you.
@glenn-jocher Now, I just figured out a speed improvement. And will give you a PR later. You can try it and give it more optimization.
Because Torchvision NMS cannot run across images mode. (if we add image related offset for boxes, it will enlarge the size of IoU matrix quadratically). So I have to try Cluster-NMS. I keep the preprocessing of NMS unchanged, and just replace the core part of your merge nms with Cluster-Weighted NMS.
Batch Size | torchvision merge nms | batch mode Cluster-Weighted NMS | Cluster-Weighted NMS | |
---|---|---|---|---|
AP | - | 42.9 | 42.9 | 42.9 |
time | 4 | 3.0ms | 4.4ms | 5.5ms |
time | 32 | 2.3ms | 3.0ms | 4.7ms |
Now I want to ask you why with batchsize increase, NMS time decrease? (for torchvision nms)
What's the max batchsize can we use? I run on 2 2080Ti GPUs. Batchsize 32 takes me about 6~7 GB memory per GPU.
I guess if we continue to increase batchsize when testing, it may be benefited more by batch mode Cluster-NMS series.
However, limited by my personal code ability, it might be possible to optimize the code better.
I think maybe the best way is to intergrate the preprocessing of NMS into batch mode either, even if it will bring us a slight performance drop. Now it takes about 1.3~1.5ms for preprocessing. And just 0.8 ms for your torchvision merge NMS. It still room for accelarating.
@Zzh-tju ah! Thanks for the interesting study. We've actually discovered that in yolov5 the regression is improved enough that we can stop using merge, and simply use the default pytorch NMS to get the same results. So the current NMS strategy we have is in yolov5 function is not to use merge anymore.
It is an interesting idea to do a batched NMS approach instead of calling the nms function once per image. Your results show a significant improvement, 2.3 / 3.0 is about 25% faster (!). This would make a huge improvement on yolov5s for example, which has inference time of 2.1ms per image at batch-size 32 FP16, about half of which is used up with NMS. See speeds here. NMS is about 1 ms per image in these numbers, so a 25% speedup there would be noticeable in the table.
https://github.com/ultralytics/yolov5#pretrained-checkpoints
Right now the boxes are offset by (class * max_image_size) to get batched per image (so different classes never overlap). I suppose to run once per batch we would offset boxes by (class * max_image_size * image_index)? Are you using torchvision.ops.nms() or torchvision.ops._batched_nms()?
@glenn-jocher no, you misunderstand me. My question is why with batchsize increase, NMS speed increase either?
@Zzh-tju in my experiments with yolov5, NMS speed is the same no matter the batch size. For example from the notebook:
!python test.py --weights yolov5s.pt --data coco128.yaml --img 640 --batch 1
!python test.py --weights yolov5s.pt --data coco128.yaml --img 640 --batch 8
!python test.py --weights yolov5s.pt --data coco128.yaml --img 640 --batch 32
Output:
Namespace(augment=False, batch_size=1, conf_thres=0.001, data='./data/coco128.yaml', device='', img_size=640, iou_thres=0.65, merge=False, save_json=False, single_cls=False, task='val', verbose=False, weights='yolov5s.pt')
Using CUDA device0 _CudaDeviceProperties(name='Tesla T4', total_memory=15079MB)
Model Summary: 191 layers, 7.46816e+06 parameters, 7.46816e+06 gradients
Fusing layers...
Model Summary: 140 layers, 7.45958e+06 parameters, 7.45958e+06 gradients
Caching labels ../coco128/labels/train2017 (126 found, 0 missing, 2 empty, 0 duplicate, for 128 images): 100% 128/128 [00:00<00:00, 8725.21it/s]
Class Images Targets P R mAP@.5 mAP@.5:.95: 100% 128/128 [00:03<00:00, 37.63it/s]
all 128 929 0.379 0.74 0.676 0.44
Speed: 9.3/1.8/11.1 ms inference/NMS/total per 640x640 image at batch-size 1
Namespace(augment=False, batch_size=8, conf_thres=0.001, data='./data/coco128.yaml', device='', img_size=640, iou_thres=0.65, merge=False, save_json=False, single_cls=False, task='val', verbose=False, weights='yolov5s.pt')
Using CUDA device0 _CudaDeviceProperties(name='Tesla T4', total_memory=15079MB)
Model Summary: 191 layers, 7.46816e+06 parameters, 7.46816e+06 gradients
Fusing layers...
Model Summary: 140 layers, 7.45958e+06 parameters, 7.45958e+06 gradients
Caching labels ../coco128/labels/train2017 (126 found, 0 missing, 2 empty, 0 duplicate, for 128 images): 100% 128/128 [00:00<00:00, 5722.17it/s]
Class Images Targets P R mAP@.5 mAP@.5:.95: 100% 16/16 [00:02<00:00, 5.41it/s]
all 128 929 0.381 0.744 0.68 0.442
Speed: 4.1/2.2/6.3 ms inference/NMS/total per 640x640 image at batch-size 8
Namespace(augment=False, batch_size=32, conf_thres=0.001, data='./data/coco128.yaml', device='', img_size=640, iou_thres=0.65, merge=False, save_json=False, single_cls=False, task='val', verbose=False, weights='yolov5s.pt')
Using CUDA device0 _CudaDeviceProperties(name='Tesla T4', total_memory=15079MB)
Model Summary: 191 layers, 7.46816e+06 parameters, 7.46816e+06 gradients
Fusing layers...
Model Summary: 140 layers, 7.45958e+06 parameters, 7.45958e+06 gradients
Caching labels ../coco128/labels/train2017 (126 found, 0 missing, 2 empty, 0 duplicate, for 128 images): 100% 128/128 [00:00<00:00, 9776.04it/s]
Class Images Targets P R mAP@.5 mAP@.5:.95: 100% 4/4 [00:04<00:00, 1.12s/it]
all 128 929 0.385 0.752 0.692 0.452
Speed: 4.2/2.1/6.3 ms inference/NMS/total per 640x640 image at batch-size 32
So 1.8ms, 2.2ms, 2.1ms at batch sizes 1, 8, 32. Basically NMS speed per image is not correlated to batch size.
got it @glenn-jocher , I will do more test with batchsize.
@glenn-jocher Hi, I have just finished a marginal work about Batch Mode Weighted Cluster-NMS for speeding up NMS. You can check https://github.com/Zzh-tju/yolov5 for details. My conclusion is Batch mode Weighted Cluster-NMS will benefit us when TTA is used.
@Zzh-tju ah, very interesting! I'll check out the forked repo.
@Zzh-tju I looked things over. You've clearly done a lot of work and experimentation!
I see it's hard to provide substantial gains off of the basic NMS unfortunately. I think this is because box regression is improving over past works, so perhaps the gains presented by merging two 0.90 iou boxes are less than for example merging two 0.5 iou boxes. It's unfortunate, because actually one of the yolov5 changes is increased grid sensetivity. In yolov3, only one cell per output layer could trigger on an object. In yolov5, >=3 cells per output layer always trigger per object (the nearest 3), so I'd expect many more boxes being proposed by yolov5 than by yolov3. It's frustrating that there isn't a better way to exploit all these extra statistics.
One very interesting piece of information I found out during the TTA and Ensembling work, I discovered that merging output grids always produced better results than appending output boxes togethor. If you look at the YOLOv5 ensembling module you will see that there are 3 options:
https://github.com/ultralytics/yolov5/blob/cab36f72a852ef00e8b42d3283ba9b2fc757b17f/models/experimental.py#L117-L129
- mean ensemble: performs mean() of all output grids, i.e. YOLOv5s output small output grid and YOLOv5m small output grid are the same shape, this takes the mean() of the two grids. Best results.
- max ensemble: same as mean(), but applies max(). Poor results.
- nms ensemble: appends all output boxes togethor for NMS to sort out. Ok results.
If there was a way to mean() TTA output grids the way that mean ensemble works, this might produce the best results, but it is very complicated due to the varying output shapes unfortunately, so abandoned this effort.
@glenn-jocher wait a second, why do TTA output grids have different shape of outputs?
@glenn-jocher And I did saw an improvement when merging two 0.8 IoU boxes rather than two 0.65 boxes.
@Zzh-tju ensemble output grids will have the same shape, for example if you run both YOLOv5s and YOLOv5m at the same image size, the 3 output grids from YOLOv5s are the same size as from YOLOv5m.
TTA uses different inference sizes as part of it's augmentation, so naturally the output grids will change in size, and can no longer be directly meaned.
Hmm, interesting, 0.8 IoU is higher than I've ever tried. I think the more accurate the box regressions, the higher you can raise the IoU threshold. What was the improvement you saw using 0.8 IoU?
@glenn-jocher see the results in https://github.com/Zzh-tju/yolov5. weighted threshold is the merging threshold
@glenn-jocher
Do you mean with input size change, the size of output grid map will change too?
@Zzh-tju yes. YOLOv5 strides are 8, 16, 32 on the small, medium and large object output layers. So a 640x640 image will have 3 output grids of size 20x20, 40x40, 80x80.
The same output grids for a 320x320 image are 10x10, 20x20, 40x40.