ultralytics/yolov3

TESTING/INFERENCE AUGMENTATION

Closed this issue · 28 comments

🚀 Feature

This issue covers test-time augmentation 07d2f0a, the practice of merging results form multiple augmented versions of the same image to obtain a higher mAP. We've implemented 2 types of augmentations: left-right flip and scale. Machine is a P100 Colab instance.

python3 test.py --img 608 --data coco2014.data --batch 16
augmentation mAP@0.5:0.95 mAP@0.5 time (ms)
default (no augment) 41.8 61.8 12.7
default + flip lr 42.6 62.6 25.8
default + scale 1.2x 39.5 60.4 25.7
default + scale 1.1x 41.1 62 25.8
default + scale 0.9x 43.2 63 25.7
default + scale 0.8x 43.4 63.2 25.8
default + scale 0.7x 43.6 63.3 25.8
default + scale 0.6x 43.3 63 25.8
default + scale 0.5x 43 62.7 25.6
default + scale 0.6x + scale 0.8x 43.7 63.4 39.9
default + flip lr + scale 0.7x 44.1 63.8 39.9
default + merge(flip lr + scale 0.7x) 43.7 63.7 26.9

Current best:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.441
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.638
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.473
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.253
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.484
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.591
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.354
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.579
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.635
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.458
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.681
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.782

@glenn-jocher
Great improvement.

great, this tech is used in training or only testing?

@Zzh-tju
Currently, it is only used for test end.

@Lornatang So, this tech is some other kinds of merge?

@Zzh-tju
Currently used to evaluate new metrics for optimal performance.

what does the last line of the tabel "merge" mean?

@Lornatang @Zzh-tju yes these are great results! Since the inference speed is so fast compared to other object detection models, I think we can run several test time augmentations to increase our test mAP while still beating other models in single-image inference speed.

The merge was a really interesting idea I had to combine two operations (a 0.7 scale and left-right flip) into one added image, so that only two images need to be analyzed (the original and the augmented). This is a speed-accuracy compromise, as it shows higher mAP than any other single operation, but not quite as high as adding each operation seperately (for 3 inference images total). But it is much faster (28ms vs 40ms).

ah. So when testing, there are two images to test (one is original image and the other is fistly flip lr then scaled by 0.7). I misunderstood that was merge NMS so doubted the speed about it. Now that I've mentioned this, I'm curious how about using merge NMS into default + merge(flip lr + scale 0.7x)?

@Zzh-tju yes mergeNMS may help a bit, but it has a massive time penalty, maybe 100 ms per image, simply because it is written in python/pytorch and not C. With all that time we could run many more augmentations instead and probably do better.

What we really need is someone good at C/pytorch to create a compiled version of merge NMS that would run nearly as fast as torchvision.ops.boxes.nms(), then it would be a much more appealing drop-in that would help nearly every pytorch object detection model :)

ok, I will try to work on it, but just don't know if I can make it. Since Fast NMS surprisely has a tiny slower than torchvision NMS, maybe the best way to accelerate other NMS (except torchvision NMS) is to realize them by CUDA implementation. Because as far as I know, CUDA NMS can at least avoid calculating IoU for one box and itself. Now I'd like to see merge NMS + default + merge(flip lr + scale 0.7x) if you can test it.

@Zzh-tju I found pytorch/vision#826 from 2019 that implemented C extensions for the nms ops. This is where my expertise ends though, unfortunately I'm not good at C or CUDA code :(

@Zzh-tju I completed a direct comparison here on a Colab instance with a T4 GPU, which is slower than the P100 used earlier. Inference time stays the same, but NMS time increases by 50X (!), though we do get a new record 44.7 mAP too (!), which is +0.6 using merge NMS.

default + flip lr + scale 0.7x

Speed: 72.4/2.1/74.4 ms inference/NMS/total per 608x608 image at batch-size 8
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.441
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.638

default + flip lr + scale 0.7x + mergeNMS

Speed: 71.9/103.7/175.7 ms inference/NMS/total per 608x608 image at batch-size 8
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.447
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.640

@Zzh-tju I've updated merge NMS in eac07f9 into a vectorized operation now. It runs much faster, and also a little better. After this update, I get the following results for augmented testing. These are my best results ever achieved on COCO 2014 with yolov3-spp.cfg. You should be able to reproduce with the command below:

$ python3 test.py --data coco2014.data --img 608 --iou 0.6 --augment --batch 16

Speed: 20.2/2.6/22.7 ms inference/NMS/total per 608x608 image at batch-size 16
 
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.447
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.641
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.485
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.271
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.492
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.583
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.357
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.587
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.652
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.488
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.701
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.787

aha, you have done this work, while I'm writing paper for a month. And your code is quite simple than mine, while I split pred to pred[0], pred[1], pred[2], pred[3].

Also, I have noticed that for a single 2080Ti GPU, computing IoU matrix costs 0.5ms. But Fast NMS costs 0.6ms. Here I ignore the extra stuffs (like sorting score, multiply max_wh). And torchvision NMS

torchvision.ops.boxes.batched_nms

costs less than 0.5ms, which means that it is faster than computing IoU matrix.
As far as I know, the main core algorithm in cuda NMS is BITWISE OR operation (if I understand correctly). I also implement BITWISE OR method by pytorch just after computing IoU matrix. And it is surprising that it takes more than 22ms (Timing is after the IOU matrix is obtained, until the output vector i is obtained.) So we can see that torchvision NMS must accelarate the whole process in some ways I don't know. Perhaps it just benefits from cuda implementation.

About matrix NMS in SOLOv2, I have tested on YOLACT but get AP=0 and AR=0. Do I miss something or implement in a wrong way?

    def cc_matrix_nms(self, boxes, masks, scores, iou_threshold:float=0.5, top_k:int=200):
        # Collapse all the classes into 1 
        scores, classes = scores.max(dim=0)

        _, idx = scores.sort(0, descending=True)
        idx = idx[:top_k]

        masks_idx = masks[idx]
        classes_idx = classes[idx]
        scores_idx = scores[idx]

        boxes_idx = boxes[idx]

        iou = jaccard(boxes_idx, boxes_idx).triu_(diagonal=1)
        iou_max = iou.max(dim=0)[0].unsqueeze(1).expand_as(iou)
        score = torch.exp(-(iou**2-iou_max**2)/0.5).min(dim=0)[0]*score

        scores, idx = scores_idx[idx_out].sort(0, descending=True)
        idx = idx[:cfg.max_num_detections]
        return boxes_idx[idx], masks_idx[idx], classes_idx[idx], scores_idx[idx]

And cross class Fast NMS is

    def cc_fast_nms(self, boxes, masks, scores, iou_threshold:float=0.5, top_k:int=200):
        # Collapse all the classes into 1 
        scores, classes = scores.max(dim=0)

        _, idx = scores.sort(0, descending=True)
        idx = idx[:top_k]

        boxes_idx = boxes[idx]

        # Compute the pairwise IoU between the boxes
        iou = jaccard(boxes_idx, boxes_idx)
        
        # Zero out the lower triangle of the cosine similarity matrix and diagonal
        iou.triu_(diagonal=1)

        # Now that everything in the diagonal and below is zeroed out, if we take the max
        # of the IoU matrix along the columns, each column will represent the maximum IoU
        # between this element and every element with a higher score than this element.
        iou_max, _ = torch.max(iou, dim=0)

        # Now just filter out the ones greater than the threshold, i.e., only keep boxes that
        # don't have a higher scoring box that would supress it in normal NMS.
        idx_out = idx[iou_max <= iou_threshold]
        
        return boxes[idx_out], masks[idx_out], classes[idx_out], scores[idx_out]

update, fix code mistakes, Matrix NMS on YOLACT with sigma=0.1 can get the best result (among sigma=0.05, 0.1, 0.2, 0.3,0.5), and box AP=0.248, while traditional NMS has 0.327.

Notice that if IoU=0, decay>1. Is it ok? I'm not sure about it.

@Zzh-tju Matrix NMS gave you 0.248 mAP vs 0.327 normal NMS? That sounds pretty bad...

Decay should range between 0 and 1 I believe.

yeah, I just output top 100 detections. The main thing I'm curious is the decay should it be <1?otherwise, you know, the IoU Matrix has many zero elements.

What's your result on YOLOv3?

ok, decay>1 is ok, since it takes column-wise minimum. With larger IoU, score lower.

@Zzh-tju but to get positive decay, you would need IOUs outside the 0-1 range no?

I have noticed that you updated the new result YOLOv3-SPP-ultralytics 608 AP 43.1. What is the difference between it and previous 41.9. I may want to use it.

@Zzh-tju various training updates. yolov3-spp.cfg remains the same. repo will automatically download the latest model, so for example if you delete your existing yolov3-spp-ultralytics.pt and run test.py the latest will be downloaded and used.

@glenn-jocher I test the new weight file, with command

python3 test.py --cfg yolov3-spp.cfg --weights yolov3-spp-ultralytics.pt --img 608

get 42.8 AP. What do I miss?

@Zzh-tju --iou 0.7 for highest mAP@0.5:0.95, --iou 0.5 for highest mAP@0.5. Default is 0.6, in the middle.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.