zplizzi/tensorflow-fast-rcnn

Question about ROI scaling

zplizzi opened this issue · 10 comments

Copied from this github issue thread.

@zplizzi I have a question about rois processing. input_rois are original (from xml) rois/ scale_factor(16), aren't they ?
Do we need to resize rois? Because in network we resize image such as "original image -> input image (224x224 - vgg) ...).

ck196 commented

Thank you for moving question here.

You're correct, the ROI pooling layer should take as input scaled ROI's (so, in the case of an AlexNet architecture, scaled down by a factor of 16 from the original 224x224).

If the images are scaled from the original size to 224x224, the bounding boxes should be scaled at that step also. I'm actually not sure that I handled that properly in my VOC_import script - I'll have to double check and fix that if I didn't.

ck196 commented

Thank you very much.

ck196 commented

I have one more problem.
When I do ROIs scaling, after scaling down they are smaller than 1. => Can not do roi pooling step.
Do you have any suggestion?

@ck196 I think you'll need to ensure that the original bounding boxes are reasonably large for this to work well. If after scaling the size is <1, that would mean a ROI with an edge less than 16 pixels in the 224x224 input image. I think in general the RCNN algorithms will work best with large ROIs - for example, with the reference architecture (with a ROI pooling layer output size of 7x7, and a downsampling factor of 16x), any ROI smaller than (7x16)x(7x16) = 112x112 in the original image will result in less than 7x7 pixels of available information inside the ROI in the final convolutional layer, thus resulting in duplicate information being passed into the final layers. I'm sure this is okay to some extent, but in the extremes will probably degrade performance.

I haven't looked into the details of this in a while, though, and am definitely not an expert - so take my understanding of this with a grain of salt.

Yes, it seems like fast-RCNN has an issue with very small objects. I haven't found any thorough study on the topic but for example in this paper they have compared fast RCNN with a variant of the OverFeat architecture and results show a significant drop of performance for objects smaller than 32 pixels length at the smaller side: http://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Zhu_Traffic-Sign_Detection_and_CVPR_2016_paper.pdf

what do you mean when you say 'duplicate information being passed into the final layers'?

@menglin0320 Imagine, for example, a 112x112 input bounding box. With 4 2x2 pooling layers (resulting in a 2^4 = 16x downsampling factor), after downsampling the box has dimensions 7x7. Since the output layers expect a 7x7 input, this is the perfect size and no information is lost. Now, imagine a bounding box of half the size. After all the pooling layers, there will be less than 7x7 pixels. Since the output layers still expect a 7x7, we have to upscale - which by definition duplicates (or at least some linear combination) the information in the smaller number of pixels into a full 7x7 region.

I got a problem when compiling the roi_pooling_op. Since I do not have GPU installed, I have configured bazel using cpu only. But when I tried to load the lib from python repl, it says:

tensorflow.python.framework.errors.NotFoundError: dlopen(/Users/dtong/code/data/tensorflow/bazel-bin/tensorflow/core/user_ops/roi_pooling_op_grad.so, 6): Symbol not found: __Z27ZeroTensorGpuKernelLauncherPfi

Do you have any idea of bazel configuration that could ignore the gpu code?

@tongda I'm moving this to a separate issue (#6), and will reply there.