
question about implementation of 2D cross layer consistency

RuyiLian opened this issue · 8 comments


Thanks again for your great work. I have a question about the implementation of 2D consistency loss

loss = ((loss_x + loss_y + loss_z) / (z_mask_sum.sum())) / 572.3 # depends on K

I am confused why the loss is divide by 572.3. In datasets/BOP_DATASETS/lm/camera.json I see the camera information is

  "cx": 325.2611,
  "cy": 242.04899,
  "depth_scale": 1.0,
  "fx": 572.4114,
  "fy": 573.57043,
  "height": 480,
  "width": 640

Also, will this impact YCBV dataset, since it has different camera intrinsic parameters? Thanks!

Thanks for your question. It is used to balance the weights of loss terms. And it's the camera focal length. So you can adjust it accordingly.

But I forgot whether I changed this parameter. I think it may not affect the final result a lot

Thanks for your reply!

Sorry to bother you again. Could you give the intuition for using 1/f as the weight? I could not find the explanation in the paper (maybe I just missed it). Thanks!

I forgot the details.
But usually, I use the weights to balance each term so that their initial ranges are similar and they can all take effect during training.

I use 1/f may be because that the 2D projections are measured on the image, but other losses are defined in 3D space. So I use 1/f to balance the terms.
According to the camera projection theory, Zp=KP, so p=fP/Z, p/f=P/Z.
Other terms are defined in 3D space, so they are defined like |P-P_gt|
while 2D loss is defined on p, so the ratio between them is 1/f, given that on LMO, the depth is typically 1-2m

I guess this is my initial motivation to use 1/f

Thanks for your reply! This is really helpful.