anson0910/CNN_face_detection

A few question sorry to trouble

Closed this issue · 3 comments

sorry to trouble you again,I have some questions about the code and the paper
1.current_rectangle = [int(2_current_x_current_scale), int(2_current_y_current_scale),
int(2_current_x_current_scale + net_kind_current_scale),
int(2_current_y_current_scale + net_kind_current_scale),
confidence, current_scale]
what is the meaning of 2 in the code(2_current_x_current_scale)?

in the paper it says built into image pyramid to cover faces at different scales ,in your code the image is only been narrowed without enlarge,is it right?

in the paper densely scan image of size 800 × 600 for 40 × 40 faces with 4-pixel spacing, which generates 2, 494 detection windows. The time reduces to 10 ms on a GPU card, most of which is
overhead in data preparation.Can you tell you how the 2, 494 is been calculated?

in 12 net it says 12 × 12 detection windows,is it because the net input is 12*12 so the window is 12?

is 4-pixel spacing corresponding to the train_val.prototxt and how the 4 is been calculated?

thank you for help me so much and are you chinese?

Hi,

  1. The 2 corresponds to the pooling layer of the first 12 net, since the pooling layer scales the input image down by a factor of 2.

Yes, narrowing the image means finding larger faces, since if you want to find smaller faces, you can just decrease the min_face_size argument of the detect_faces_net function.

The number of detection windows generated can be calculated as follows:
We wish to find 40 × 40 faces, so we first downscale the original image by a factor of 12/40, which results in an image of size 240 x 180, generating ((240 - 40) / 4 + 1) * ((180 - 40) / 4 + 1) = 1836 windows, and depending on the resizing factor for creating the pyramid, the number of detection windows may vary.

In 12 net it says 12 × 12 detection windows,is it because the net input is 12*12 so the window is 12?
Yes

is 4-pixel spacing corresponding to the train_val.prototxt and how the 4 is been calculated?
The spacing can be any value, because according to the description of the original paper, crops are taken out of the image and fed into the 12-net one at a time.
However, I have modified the 12-net to be a fully convolutional neural network, such that much redundant work can be saved.
You can take a look at this link if you're interested.

No welcome, I'm from Taiwan!

Thank you and why the factor is calculated by 12/40,what is the meaning of the factor?
I put a 466 x 699 image to the network after resize_image it is 139 x209 and the (out = net_12c_full_conv.blobs['prob'].data[0][1, :, :]) out.shape is 64*99, is each confidence point in 64 x99 means the possibility of a face and if so why a point can represent a rectangle?
In the paper it says the image pyramid is resized by 12/F ,is it means w x12/F,h x12/F?
I use 1W face image and 1W background image without face to train the 12net, is it enough?

The factor resizes the image such that after resizing, each 12 x 12 block corresponds to min_face_size x min_face_size block in original image.

why a point can represent a rectangle?
A point in the last output feature map represents a 12 x 12 block in resized image, which in turn corresponds to min_face_size x min_face_size block in original image.

In the paper it says the image pyramid is resized by 12/F ,is it means w x12/F,h x12/F?
Yes.

I use 1W face image and 1W background image without face to train the 12net, is it enough?
Yes, I think this is enough.