Clarification about bbox parameters
Closed this issue · 10 comments
Hi authors,
Thanks for the great work! I noticed that the bbox center coordinates were divided by 8, where did this number come from?
Hand4Whole_RELEASE/common/nets/module.py
Lines 134 to 136 in afcbdf4
And why is / 8
not applied to the lhand_size
, rhand_size
, and face_size
too?
Hi, that is to restore 8x times upsampled feature map size due to the deconv layer of this line
Hand4Whole_RELEASE/common/nets/module.py
Line 119 in afcbdf4
@mks0601 Thanks for the clarification. Why is it not applied to the lhand_size
, rhand_size
, and face_size
too? Since they were also derived from img_feat
Before the downsampling, _center
are defined in the upsampled feature map space as they are obtained by the soft-argmax.
_size
are defined in original feature map space (before upsampling) as GTs are defined in the original space.
@mks0601 Thanks for the prompt reply! I'm not too sure if my understanding is correct, but _size
seems to be in the upsampled feature map space too?
Hand4Whole_RELEASE/common/nets/module.py
Lines 119 to 136 in afcbdf4
img_feat
after the deconv layers is is 2248x8x6 -> 256x64x48.
_center
is derived from img_feat
which is the upsampled feature map and therefore needs to be scaled down.
_size
is derived from img_feat
which is the upsampled feature map. Therefore, I'm confused why it is not scaled down?
_center
is dependent on the shape of the feature map as they are obtained by the soft-argmax function. Taking an example of argmax function, coordinates from argmax function is dependent on the shape of the feature map.
_size
is independent on the shape of the feature map as they are directly regressed. Any numbers could be regressed, and as it is supervised with GT, defined in the original (before the upsampling) space, _size
does not need to be divided by 8.
@mks0601 My concern is that the size network would have to learn to predict a value in the downsampled space (since the supervision is in the original space) when given information in the upsampled feature space. However, I do agree that the correct _size
could still be learned. Thanks for the clarification!
Hi, I think there is a still some confusion. The size network is not aware of whether the features are from upsampled or downsampled feature maps.
@mks0601 Just to clarify, this is because the sampled joint features (joint features sampled from image features using joint coordinates) are no longer in any absolute space?
yes. and the output is also not in any absolute space as the values are directly regressed.