TESTING/INFERENCE AUGMENTATION
glenn-jocher opened this issue · 28 comments
🚀 Feature
This issue covers test-time augmentation 07d2f0a, the practice of merging results form multiple augmented versions of the same image to obtain a higher mAP. We've implemented 2 types of augmentations: left-right flip and scale. Machine is a P100 Colab instance.
python3 test.py --img 608 --data coco2014.data --batch 16
augmentation | mAP@0.5:0.95 | mAP@0.5 | time (ms) |
---|---|---|---|
default (no augment) | 41.8 | 61.8 | 12.7 |
default + flip lr | 42.6 | 62.6 | 25.8 |
default + scale 1.2x | 39.5 | 60.4 | 25.7 |
default + scale 1.1x | 41.1 | 62 | 25.8 |
default + scale 0.9x | 43.2 | 63 | 25.7 |
default + scale 0.8x | 43.4 | 63.2 | 25.8 |
default + scale 0.7x | 43.6 | 63.3 | 25.8 |
default + scale 0.6x | 43.3 | 63 | 25.8 |
default + scale 0.5x | 43 | 62.7 | 25.6 |
default + scale 0.6x + scale 0.8x | 43.7 | 63.4 | 39.9 |
default + flip lr + scale 0.7x | 44.1 | 63.8 | 39.9 |
default + merge(flip lr + scale 0.7x) | 43.7 | 63.7 | 26.9 |
Current best:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.441
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.638
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.473
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.253
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.484
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.591
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.354
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.579
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.635
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.458
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.681
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.782
@glenn-jocher
Great improvement.
great, this tech is used in training or only testing?
@Lornatang So, this tech is some other kinds of merge?
what does the last line of the tabel "merge" mean?
@Lornatang @Zzh-tju yes these are great results! Since the inference speed is so fast compared to other object detection models, I think we can run several test time augmentations to increase our test mAP while still beating other models in single-image inference speed.
The merge was a really interesting idea I had to combine two operations (a 0.7 scale and left-right flip) into one added image, so that only two images need to be analyzed (the original and the augmented). This is a speed-accuracy compromise, as it shows higher mAP than any other single operation, but not quite as high as adding each operation seperately (for 3 inference images total). But it is much faster (28ms vs 40ms).
ah. So when testing, there are two images to test (one is original image and the other is fistly flip lr then scaled by 0.7). I misunderstood that was merge NMS so doubted the speed about it. Now that I've mentioned this, I'm curious how about using merge NMS into default + merge(flip lr + scale 0.7x)?
@Zzh-tju yes mergeNMS may help a bit, but it has a massive time penalty, maybe 100 ms per image, simply because it is written in python/pytorch and not C. With all that time we could run many more augmentations instead and probably do better.
What we really need is someone good at C/pytorch to create a compiled version of merge NMS that would run nearly as fast as torchvision.ops.boxes.nms()
, then it would be a much more appealing drop-in that would help nearly every pytorch object detection model :)
ok, I will try to work on it, but just don't know if I can make it. Since Fast NMS surprisely has a tiny slower than torchvision NMS, maybe the best way to accelerate other NMS (except torchvision NMS) is to realize them by CUDA implementation. Because as far as I know, CUDA NMS can at least avoid calculating IoU for one box and itself. Now I'd like to see merge NMS + default + merge(flip lr + scale 0.7x) if you can test it.
@Zzh-tju I found pytorch/vision#826 from 2019 that implemented C extensions for the nms ops. This is where my expertise ends though, unfortunately I'm not good at C or CUDA code :(
@Zzh-tju I completed a direct comparison here on a Colab instance with a T4 GPU, which is slower than the P100 used earlier. Inference time stays the same, but NMS time increases by 50X (!), though we do get a new record 44.7 mAP too (!), which is +0.6 using merge NMS.
default + flip lr + scale 0.7x
Speed: 72.4/2.1/74.4 ms inference/NMS/total per 608x608 image at batch-size 8
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.441
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.638
default + flip lr + scale 0.7x + mergeNMS
Speed: 71.9/103.7/175.7 ms inference/NMS/total per 608x608 image at batch-size 8
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.447
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.640
@Zzh-tju I've updated merge NMS in eac07f9 into a vectorized operation now. It runs much faster, and also a little better. After this update, I get the following results for augmented testing. These are my best results ever achieved on COCO 2014 with yolov3-spp.cfg. You should be able to reproduce with the command below:
$ python3 test.py --data coco2014.data --img 608 --iou 0.6 --augment --batch 16
Speed: 20.2/2.6/22.7 ms inference/NMS/total per 608x608 image at batch-size 16
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.447
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.641
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.485
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.271
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.492
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.583
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.357
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.587
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.652
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.488
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.701
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.787
aha, you have done this work, while I'm writing paper for a month. And your code is quite simple than mine, while I split pred to pred[0], pred[1], pred[2], pred[3].
Also, I have noticed that for a single 2080Ti GPU, computing IoU matrix costs 0.5ms. But Fast NMS costs 0.6ms. Here I ignore the extra stuffs (like sorting score, multiply max_wh). And torchvision NMS
torchvision.ops.boxes.batched_nms
costs less than 0.5ms, which means that it is faster than computing IoU matrix.
As far as I know, the main core algorithm in cuda NMS is BITWISE OR operation (if I understand correctly). I also implement BITWISE OR method by pytorch just after computing IoU matrix. And it is surprising that it takes more than 22ms (Timing is after the IOU matrix is obtained, until the output vector i
is obtained.) So we can see that torchvision NMS must accelarate the whole process in some ways I don't know. Perhaps it just benefits from cuda implementation.
About matrix NMS in SOLOv2, I have tested on YOLACT but get AP=0 and AR=0. Do I miss something or implement in a wrong way?
def cc_matrix_nms(self, boxes, masks, scores, iou_threshold:float=0.5, top_k:int=200):
# Collapse all the classes into 1
scores, classes = scores.max(dim=0)
_, idx = scores.sort(0, descending=True)
idx = idx[:top_k]
masks_idx = masks[idx]
classes_idx = classes[idx]
scores_idx = scores[idx]
boxes_idx = boxes[idx]
iou = jaccard(boxes_idx, boxes_idx).triu_(diagonal=1)
iou_max = iou.max(dim=0)[0].unsqueeze(1).expand_as(iou)
score = torch.exp(-(iou**2-iou_max**2)/0.5).min(dim=0)[0]*score
scores, idx = scores_idx[idx_out].sort(0, descending=True)
idx = idx[:cfg.max_num_detections]
return boxes_idx[idx], masks_idx[idx], classes_idx[idx], scores_idx[idx]
And cross class Fast NMS is
def cc_fast_nms(self, boxes, masks, scores, iou_threshold:float=0.5, top_k:int=200):
# Collapse all the classes into 1
scores, classes = scores.max(dim=0)
_, idx = scores.sort(0, descending=True)
idx = idx[:top_k]
boxes_idx = boxes[idx]
# Compute the pairwise IoU between the boxes
iou = jaccard(boxes_idx, boxes_idx)
# Zero out the lower triangle of the cosine similarity matrix and diagonal
iou.triu_(diagonal=1)
# Now that everything in the diagonal and below is zeroed out, if we take the max
# of the IoU matrix along the columns, each column will represent the maximum IoU
# between this element and every element with a higher score than this element.
iou_max, _ = torch.max(iou, dim=0)
# Now just filter out the ones greater than the threshold, i.e., only keep boxes that
# don't have a higher scoring box that would supress it in normal NMS.
idx_out = idx[iou_max <= iou_threshold]
return boxes[idx_out], masks[idx_out], classes[idx_out], scores[idx_out]
update, fix code mistakes, Matrix NMS on YOLACT with sigma=0.1 can get the best result (among sigma=0.05, 0.1, 0.2, 0.3,0.5), and box AP=0.248, while traditional NMS has 0.327.
Notice that if IoU=0, decay>1. Is it ok? I'm not sure about it.
@Zzh-tju Matrix NMS gave you 0.248 mAP vs 0.327 normal NMS? That sounds pretty bad...
Decay should range between 0 and 1 I believe.
yeah, I just output top 100 detections. The main thing I'm curious is the decay should it be <1?otherwise, you know, the IoU Matrix has many zero elements.
What's your result on YOLOv3?
ok, decay>1 is ok, since it takes column-wise minimum. With larger IoU, score lower.
@Zzh-tju but to get positive decay, you would need IOUs outside the 0-1 range no?
I have noticed that you updated the new result YOLOv3-SPP-ultralytics 608 AP 43.1. What is the difference between it and previous 41.9. I may want to use it.
@Zzh-tju various training updates. yolov3-spp.cfg remains the same. repo will automatically download the latest model, so for example if you delete your existing yolov3-spp-ultralytics.pt and run test.py the latest will be downloaded and used.
@glenn-jocher I test the new weight file, with command
python3 test.py --cfg yolov3-spp.cfg --weights yolov3-spp-ultralytics.pt --img 608
get 42.8 AP. What do I miss?
@Zzh-tju --iou 0.7
for highest mAP@0.5:0.95, --iou 0.5
for highest mAP@0.5. Default is 0.6, in the middle.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.