Oriented Bounding box detector modifying FCOS. This work is built on top of MMDetection.
To run the codes refer to the instruction of MMDetection instruction.
My library version and configurations were:
mmcv 0.2.16
mmdet 1.0rc0+90c4798
torch 1.1.0
torchvision 0.2.1
Cython 0.29.15
numpy 1.17.4
gcc 4.9.4
CUDA 9.0
Evaluate the performance of polarmask on DOTA (Dataset for Object Detection in Aerial Images) dataset and compare with existing state-of-the-art models. Extend the research to including some of the ideas from existing models into polar mask to improve the segmentation.
ROI transformer is am two-stage detector which is 2nd in oriented object detection, 4th in horizontal detection of DOTA satellite image dataset RoI transformer’s training process involves a supervised RROI learner is a module that transforms horizontal ROIs to rotation ROIs. This allows the model to have less number of anchor boxes; this is different from models that make use of large number of rotated anchor to achieve the same output. Rotated ROI warping makes use of the orientation learnt to make the features extracted become rotationally invariant.
SCRDet is a two-stage detector ranking 8th in oriented object detection task, 6th in horizontal detection task of DOTA satellite image dataset. To summarize the model’s modules and architecture, there are three main parts consisting the model: SF-Net, MDA-Net and rotation branch. SF-Net uses C3,C4 layer of ResNet for fusion to balance the semantic information and location information while ignoring other less relevant facts. MDA-Net (multi-dimensional attention network) enhances the object cues and weakens the non-object information through making use of a branch learning the saliency map of input; in this process attention loss is calculated. Rotation Branch makes use of rotation non maximum suppression as post-processing through the calculation of skew IOU computation.
R3Det +++ (Refined Rotation RetinaNet) is an anchor based single-stage rotation detector it is ranked 7th in oriented object detection task of DOTA satellite image dataset. To summarize the architecture and modules, R3Det involves a refinement stage for not only the detected box but also the feature map. The refinement of feature map involves a 5x1, 1x5 and 1x1 convolution for refinement of detected box. This can be seen as a measure to deal with large aspect ratio, a rotation single-stage detector is used in a refined manner. Using rotated anchor in dense scene achieved higher recall rate compared to the use of horizontal anchors, so the model uses a combination strategy of two types of anchors adopted for refinement.
DOTA dataset, which is a satellite image dataset, has 2806 images where the annotations are given as 4 coordinate points of the oriented bounding box and it has 15 categories including plane,ship, basketball court etc. Within these 2806 images there are over 180 K instances which shows that there are a large number of instances in each image. Despite such complexity of the dataset, the state of the art models for the dataset is achieving over 0.8 mAP (RoI Transformer + additional).
The choice between horizontal and oriented bounding box can be crucial in some applications. For example ,due to its unique shape and nature of image, it is not possible to accurately locate a marine vessel with a horizontal object detector. This is because vessels can not only come in large scale variations but also angle orientation. After RPN outputs regional proposals, non-maximum suppression is generally applied as an important step in reducing the number of candidates and increasing detection efficiency. However for oriented object problems, overlapping horizontal bounding boxes will make it difficult to distinguish vessels crowded near the port. Applying NMS to the screen area can result in missing targets.
The modification I have made is to change the length of the regression vector from 4 to 5; that is to append a target of angle as well.
Therefore output of model will have shapes as below:
There are 5 feature map levels (due to FPN structure) and the (w_i, h_i) represents the dimension of feature level.
Class score = List(feature level, N, 15, w_i , h_i)
Bbox pred = List(feature level, N , 5, w_i, h_i)
Centeredness = List (feature level, N , w_i, h_i)
In order to find out the ground truth values for 4 distances and the angle and training, we employ the following method:
This is the contour of the bounding box and color representing the angle with respect to the subject pixel position. Clockwise movement along the contour represents angle increasing from 0 to 360; then we make use of the GT angle of the bbox to find the four rays.
Here the red dot represents the point of subject, green dots represents the 4 points allows us to infer the perpendicular distances and purple dot being the original center of the oriented box.
Reimplement results of Baseline, R3 Det ++, FCOS, ROI transformer on DOTA dataset
Obtain results of plain polarmask on DOTA for comparison (instance segmentation)
Error Analysis of plain polarmask on DOTA (instance segmentation)
Make changes to modules of polarmask for oriented bounding box prediction.
Error analysis and improvements
Model | mAP |
---|---|
RetinaNet (Baseline) | 0.51 |
R3Det | 0.66 |
RoI Transformer | 0.72 |
FCOS-OBB (Mine) | 0.59 |
class | mAP |
---|---|
plane | 0.8644827688468113 |
baseball-diamond | 0.6303450416768105 |
bridge | 0.2836234662432951 |
ground-track-field | 0.4882575648855269 |
small-vehicle | 0.6472025055646514 |
large-vehicle | 0.5840595934115256 |
ship | 0.6647569189809845 |
tennis-court | 0.9068834035721455 |
basketball-court | 0.6929957009396605 |
storage-tank | 0.7713682587649533 |
soccer-ball-field | 0.44913695971245626 |
roundabout | 0.5449657166644799 |
harbor | 0.5146133776276023 |
swimming-pool | 0.3366705928715302 |
helicopter | 0.3861791785489448 |
As mentioned above, the major challenges to be solved from this point is to deal with small objects in a cluttered scene and fix the problem of angle regression. Referring to the models discussed before, possible methods for improvements would be as such:
1. Refinement of predictions based on the ideas of R3 det +++
Refinement has been proven to be effective by different types of refined object detectors (cascade RCNN, RefineDet)
2. Pixel attention branch on the feature maps of output of FPN neck based on ideas of SCRDet
This is shown to be especially effective to deal with images with cluttered instances.
3. Weighted regression loss
Similar to focal loss, we aim to put more emphasis on the angle regression because it is a crucial factor in the prediction of oriented boxes; small change in angle leads to large drop in skew IoU.
Therefore adding weight to the smooth L1 loss in a way to regularize angle prediction can be an effective way to improve performance.