The scale of the depth

I've been studying the data processing part of algorithms recently, and I've noticed that in the getitem function, when the input image bounding box size is not consistent with cfg.input_img_shape, the image is resized along the x and y dimensions by the code at the bottom of the augmentation function .

for i in range(joint_num):
        joint_coord[i,:2] = trans_point2d(joint_coord[i,:2], trans)
        joint_valid[i] = joint_valid[i] * (joint_coord[i,0] >= 0) * (joint_coord[i,0] < cfg.input_img_shape[1]) * (joint_coord[i,1] >= 0) * (joint_coord[i,1] < cfg.input_img_shape[0])

However, no corresponding processing is applied to the depth information. Therefore, during the actual training process, the x and y values predicted by the depth model are relative to the trainning image, while the depth information is in the camera's coordinate space. Is this understanding correct? Should the depth information also be scaled accordingly?

You can see this function

InterHand2.6M/common/utils/preprocessing.py

Line 96 in 655ba3c

    
           def transform_input_to_output_space(joint_coord, joint_valid, rel_root_depth, root_valid, root_joint_idx, joint_type):