Understanding cascading of sizes in mtcnn

Question

Understanding cascading of sizes in mtcnn

sidgan opened this issue 7 years ago · 3 comments

Hi,
Im trying to follow through the code and understand how mtcnn works. I understand that for each image, for each scale the detection comes from each of the networks. In particular I am talking about the Pnet right now.

The image is rescaled according to the scales produced earlier and the rescaled image goes into the Pnet as follows in the following line in the code:

https://github.com/DuinoDu/mtcnn/blob/master/demo.py#L268

For reference I have printed out the original size and the rescaled size:
ORIGINAL Height: 340
ORIGINAL Width: 151
SCALE USED (were computed before): 0.107493555074
RESCALED Height: 37
RESCALED Width: 17

The net corresponds to Pnet and in det1.prototxt (PNet) the input size should have h=12 and w=12.

# Code file: det1.prototxt 
input_dim: 1
input_dim: 3
input_dim: 12
input_dim: 12

What I don't understand is where is the size going from size of image to 12x12?

Answer 1 · 2017-07-14T00:29:29.000Z

Hey, thanks for your asking. As PNet is a fully convolution network, input size is not limited. So 37x17 is OK for the input of PNet. As for 12x12 in det1.prototxt, it is just an initialization of the network, not standing for actual image input size.

Answer 2 · 2017-07-14T00:46:19.000Z

Could you please explain how this works? How does the network scale to each input size ?

Answer 3 · 2017-07-14T01:16:14.000Z

It's much like rpn in fasterRCNN. The network doesn't scale each input size to a fixed size, since there is no fully connected layer in network. 12x12 is used in https://github.com/DuinoDu/mtcnn/blob/master/demo.py#L153
That means for any input size, PNet first generates feature maps. This feature maps can be any size. And then, use a sliding window to generate fixed size seed boxes, as you know, 12x12. This work is done in https://github.com/DuinoDu/mtcnn/blob/master/demo.py#L151
def generateBoundingBox(map, reg, scale, t). Then these seed boxes are moved closer to target, using bounding box regression. Besides, use the third branch---facial landmark localization to improve accuracy of face detection.