About the Equation 5 for Full Surround Monodepth from Multiple Cameras
haoweiz23 opened this issue · 20 comments
Hi, thank your for your works. I am trying to reproduce your pose consistency loss. This loss constraints the predicted pose from other camera to be consistent with the front camera after transformation. However, It is hard to understand how the coordinate transforms to other coordinate by Equ.5 . Could you please provide more explanation clues or detail code? Thanks.
This is my code I implement, I can not promise it's right
`
calculate the pose consistency loss
:param poses: list of torch.tensor [B, 4, 4]
transform those prediction to coordinate frame of canonical camera.
:param extrinsics: torch.tensor [B, 4, 4]
extrinsics for all cameras, which be used to transform pose.
:return:
rot_loss = 0
trans_loss = 0
extrinsics = extrinsics.to(poses[0].item().dtype)
canonical_extrinsic = extrinsics[0].repeat([extrinsics.shape[0], 1, 1]) # [B, 4, 4]
canonical_extrinsic = Pose(canonical_extrinsic)
extrinsics = Pose(extrinsics)
# extrinsic = extrinsics[1:, ...]
for pose in poses:
X_i2j = canonical_extrinsic.inverse() @ extrinsics
X_ba = X_i2j @ pose @ X_i2j.inverse()
rot_loss += torch.sum((X_ba.mat2vec()[:, :3] - pose.mat2vec()[0, :3]).pow(2))
trans_loss += torch.sum((X_ba.mat2vec()[:, 3:] - pose.mat2vec()[0, 3:]).pow(2))
loss = self.rotation_weight * rot_loss + self.translation_weight * trans_loss
`
@hjxwhy Thanks a lot. I believe this is right. By the way, have you eval implement spatio-temporal loss in FSM? I cannot achieve the same improvement (even decrease) as Table.3 in FSM paper. Maybe there are some problem in my implementation.
I implement spatial-wise pe loss as below, as Equation 3 in FSM paper.
` def spatial_wise_pe_loss(self, batch, output, return_logs=False, progress=0.0):
# Calculate spatial contexts
spatial_contexts_indices = np.array([[1, 2], [0, 3], [0, 4], [1, 5], [2, 5], [3, 4]])
spatial_contexts_rgb = [batch['rgb_original'][spatial_contexts_indices[:, 0]],
batch['rgb_original'][spatial_contexts_indices[:, 1]]]
poses = torch.Tensor(batch['extrinsics']) if isinstance(batch['extrinsics'], list) else batch['extrinsics']
intrinsics = torch.Tensor(batch['intrinsics']) if isinstance(batch['intrinsics'], list) else batch[
'intrinsics']
spatial_context_intrinsics = [intrinsics[spatial_contexts_indices[:, 0]],
intrinsics[spatial_contexts_indices[:, 1]]]
spatial_context_masks = [batch['mask'][spatial_contexts_indices[:, 0]],
batch['mask'][spatial_contexts_indices[:, 1]]]
source_poses = Pose(poses)
reference_poses = [Pose(poses[spatial_contexts_indices[:, 0]]),
Pose(poses[spatial_contexts_indices[:, 1]])]
relative_poses = [Pose(torch.bmm(reference_poses[0].inverse().item(), source_poses.item())),
Pose(torch.bmm(reference_poses[1].inverse().item(), source_poses.item()))]
spatial_output = self.self_supervised_loss(
batch['rgb_original'], spatial_contexts_rgb,
output['inv_depths'], relative_poses, intrinsics, spatial_context_intrinsics,
return_logs=return_logs, progress=progress, mask=batch['mask'], ref_mask=spatial_context_masks)
return spatial_output`
The implementation looks alright to me. Some things that have helped other people achieving similar results:
- Starting from a pre-trained model without the spatio-temporal constraints
- Defining a larger value for the minimum depth of the network, so there is overlap between cameras to begin with (otherwise the temporal network can produce a scale that doesn't have any spatial overlap, and it doesn't leverage those constraints)
- Focal length scaling for the output depth maps (the front camera of DDAD has a different intrinsics than other cameras)
Hi,
What do you mean by focal length scaling? Would you mind if you provide more details regarding that?
Instead of training the depth decoder to handle different intrinsics, is it about using a constant to rescale the depth value for the front view?
Thank you!
@VitorGuizilini-TRI Thanks a lot ! Your suggestion is very helpful. I tried focal length scaling and it works. I am tryining start from a pretrained model without the spatio-temporal constraints now.
And I don't quite understand your second suggestion. Why larger value for the minimum depth helps? Is it because the larger depth can produce more overlapping areas when perform projection transformation between different cameras? If so, do you have a recommend minimum depth?
Thank you again for your timely suggestions.
@hurjunhwa Hi, I implement focal length scaling by scale the output depth by a constant, i.e., focal length. This focal length comes from the intrinsics input. Because I do not have the camera parameters, e.g., dx and dy. So I simply take the f_x in intrinsics as focal length to scale the depth. I tried this trick on DDAD and it works. Hope this can be helpful.
@LionRoarRoar My STC implement is the same as you,but the result also degrade. You have try to scale the depth by focal length, which means that the every camera output multiple focal length or divide focal length? By the way, as my test, the input image with self occlusion cause the RMSE larger than front camera only, Have you faced this problem?
I scale each camera output with its corresponding focal length. All other cameras get all worse results than front camera in my experiments. Only RMSE larger than front camera seems unreasonable? Maybe you have wrong normalization layer in last output layer.
@LionRoarRoar Thanks for your reply. I have an experiment that train only front camera and CAMERA_8 seperate, the CAMERA_8 is worse than front camera in all metrics, so I guess it's cause by the self occlusion in image in CAMERA_8. But I'm not sure because the paper seems don't have this problem. Do you plan to do this experiment? I'm sorry for ask again, the scale depth means inverse depth multiple focal length?
@hjxwhy
A1: Maybe your hypothesis is right. I noticed that self-occlusion have slightly shift on different frames, which means it is hard to pre-define a accurate self occlusion mask. Images from front camera is clean and withoud occlusion, so it should get better results than other cameras.
A2: You should scale depth map instead of inverse depth
@LionRoarRoar THANKS, I will try again. If I have some new results I will share with you here. Best wishes!
Updates:
1、I tried spatial-wise constraint start from a pre-trained model without the spatio-temporal constraints. It indeed better than w/o pretrained. However, it still worse than baseline model. Besides, I am afraid this trick make spatial-wise constraint can not be compared with baseline fairly?
2、I also tried spatial-wise loss with a larger min_depth start from a pre-trained model without the spatio-temporal constraints. And the performance drops.
This is my code I implement, I can not promise it's right ` calculate the pose consistency loss :param poses: list of torch.tensor [B, 4, 4] transform those prediction to coordinate frame of canonical camera. :param extrinsics: torch.tensor [B, 4, 4] extrinsics for all cameras, which be used to transform pose. :return:
rot_loss = 0 trans_loss = 0 extrinsics = extrinsics.to(poses[0].item().dtype) canonical_extrinsic = extrinsics[0].repeat([extrinsics.shape[0], 1, 1]) # [B, 4, 4] canonical_extrinsic = Pose(canonical_extrinsic) extrinsics = Pose(extrinsics) # extrinsic = extrinsics[1:, ...] for pose in poses: X_i2j = canonical_extrinsic.inverse() @ extrinsics X_ba = X_i2j @ pose @ X_i2j.inverse() rot_loss += torch.sum((X_ba.mat2vec()[:, :3] - pose.mat2vec()[0, :3]).pow(2)) trans_loss += torch.sum((X_ba.mat2vec()[:, 3:] - pose.mat2vec()[0, 3:]).pow(2)) loss = self.rotation_weight * rot_loss + self.translation_weight * trans_loss
`
rot_loss += torch.sum((X_ba.mat2vec()[:, :3] - pose.mat2vec()[0, :3]).pow(2)),I think the pose here should use cam1_pose supervise
Updates: 1、I tried spatial-wise constraint start from a pre-trained model without the spatio-temporal constraints. It indeed better than w/o pretrained. However, it still worse than baseline model. Besides, I am afraid this trick make spatial-wise constraint can not be compared with baseline fairly?
2、I also tried spatial-wise loss with a larger min_depth start from a pre-trained model without the spatio-temporal constraints. And the performance drops.
Have you reached the accuracy of the paper? I can't reproduce it
@abing222 No. Only Self-oclussion mask work. STC and Pose consistency loss does not work.
@abing222 No. Only Self-oclussion mask work. STC and Pose consistency loss does not work.
In my experiment, Only Self-oclussion mask absrel did not decline as much as the paper
@abing222 No. Only Self-oclussion mask work. STC and Pose consistency loss does not work.
At present, I can obtain the absolute scale through spatio, the accuracy decreases slightly. After adding STC, the accuracy increases a little
@abing222 No. Only Self-oclussion mask work. STC and Pose consistency loss does not work.
At present, I can obtain the absolute scale through spatio, the accuracy decreases slightly. After adding STC, the accuracy increases a little
You mean spatial-wise constraints not work but STC works? That is interesting. Could you please provide more implement details about your STC?Such as loss weight, how to warp
spatial-temporal image
Self-oclussion mask
@abing222 No. Only Self-oclussion mask work. STC and Pose consistency loss does not work.
At present, I can obtain the absolute scale through spatio, the accuracy decreases slightly. After adding STC, the accuracy increases a little
You mean spatial-wise constraints not work but STC works? That is interesting. Could you please provide more implement details about your STC?Such as loss weight, how to warp spatial-temporal image
spatial-wise is useful, provide absolute scale, but the accuracy decreased. I changed code on the basis of monodepth2 repo code without using packnet repo
@abing222 No. Only Self-oclussion mask work. STC and Pose consistency loss does not work.
At present, I can obtain the absolute scale through spatio, the accuracy decreases slightly. After adding STC, the accuracy increases a little
I also cannot obtain the absolute scale with the spatio photometric loss. Do you use any pretrained model? Or change the min_depth parameter in monodepth2 repo?
@abing222 No. Only Self-oclussion mask work. STC and Pose consistency loss does not work.
At present, I can obtain the absolute scale through spatio, the accuracy decreases slightly. After adding STC, the accuracy increases a little
I also cannot obtain the absolute scale with the spatio photometric loss. Do you use any pretrained model? Or change the min_depth parameter in monodepth2 repo?
Hi, weiyi. I am also try to implement this work. Maybe we can add wechat for discussion. My wechat: zhuhaow_