Clarification about bbox parameters

Hi authors,

Thanks for the great work! I noticed that the bbox center coordinates were divided by 8, where did this number come from?

Hand4Whole_RELEASE/common/nets/module.py

Lines 134 to 136 in afcbdf4

    
           lhand_center = lhand_center / 8 
        
           rhand_center = rhand_center / 8 
        
           face_center = face_center / 8

And why is / 8 not applied to the lhand_size, rhand_size, and face_size too?

Hi, that is to restore 8x times upsampled feature map size due to the deconv layer of this line

Hand4Whole_RELEASE/common/nets/module.py

Line 119 in afcbdf4

img_feat = self.deconv(img_feat)

@mks0601 Thanks for the clarification. Why is it not applied to the lhand_size, rhand_size, and face_size too? Since they were also derived from img_feat

Before the downsampling, _center are defined in the upsampled feature map space as they are obtained by the soft-argmax.
_size are defined in original feature map space (before upsampling) as GTs are defined in the original space.

@mks0601 Thanks for the prompt reply! I'm not too sure if my understanding is correct, but _size seems to be in the upsampled feature map space too?

Hand4Whole_RELEASE/common/nets/module.py

Lines 119 to 136 in afcbdf4

    
           img_feat = self.deconv(img_feat) 
        
           # bbox center 
        
           bbox_center_hm = self.bbox_center(img_feat) 
        
           bbox_center = soft_argmax_2d(bbox_center_hm) 
        
           lhand_center, rhand_center, face_center = bbox_center[:,0,:], bbox_center[:,1,:], bbox_center[:,2,:] 
        
           # bbox size 
        
           lhand_feat = sample_joint_features(img_feat, lhand_center[:,None,:].detach())[:,0,:] 
        
           lhand_size = self.lhand_size(lhand_feat) 
        
           rhand_feat = sample_joint_features(img_feat, rhand_center[:,None,:].detach())[:,0,:] 
        
           rhand_size = self.rhand_size(rhand_feat) 
        
           face_feat = sample_joint_features(img_feat, face_center[:,None,:].detach())[:,0,:] 
        
           face_size = self.face_size(face_feat) 
        
           lhand_center = lhand_center / 8 
        
           rhand_center = rhand_center / 8 
        
           face_center = face_center / 8

img_feat after the deconv layers is is 2248x8x6 -> 256x64x48.
_center is derived from img_feat which is the upsampled feature map and therefore needs to be scaled down.
_size is derived from img_feat which is the upsampled feature map. Therefore, I'm confused why it is not scaled down?

_center is dependent on the shape of the feature map as they are obtained by the soft-argmax function. Taking an example of argmax function, coordinates from argmax function is dependent on the shape of the feature map.
_size is independent on the shape of the feature map as they are directly regressed. Any numbers could be regressed, and as it is supervised with GT, defined in the original (before the upsampling) space, _size does not need to be divided by 8.

@mks0601 My concern is that the size network would have to learn to predict a value in the downsampled space (since the supervision is in the original space) when given information in the upsampled feature space. However, I do agree that the correct _size could still be learned. Thanks for the clarification!

Hi, I think there is a still some confusion. The size network is not aware of whether the features are from upsampled or downsampled feature maps.

@mks0601 Just to clarify, this is because the sampled joint features (joint features sampled from image features using joint coordinates) are no longer in any absolute space?

yes. and the output is also not in any absolute space as the values are directly regressed.

@mks0601 Got it, thanks for clearing that up!

	lhand_center = lhand_center / 8
	rhand_center = rhand_center / 8
	face_center = face_center / 8

	img_feat = self.deconv(img_feat)

	# bbox center
	bbox_center_hm = self.bbox_center(img_feat)
	bbox_center = soft_argmax_2d(bbox_center_hm)
	lhand_center, rhand_center, face_center = bbox_center[:,0,:], bbox_center[:,1,:], bbox_center[:,2,:]

	# bbox size
	lhand_feat = sample_joint_features(img_feat, lhand_center[:,None,:].detach())[:,0,:]
	lhand_size = self.lhand_size(lhand_feat)
	rhand_feat = sample_joint_features(img_feat, rhand_center[:,None,:].detach())[:,0,:]
	rhand_size = self.rhand_size(rhand_feat)
	face_feat = sample_joint_features(img_feat, face_center[:,None,:].detach())[:,0,:]
	face_size = self.face_size(face_feat)

	lhand_center = lhand_center / 8
	rhand_center = rhand_center / 8
	face_center = face_center / 8