fabiotosi92/NeRF-Supervised-Deep-Stereo

Some questions regarding training.

Closed this issue · 7 comments

Thank you for your outstanding work. I would like to know if the mentioned loss functions in the paper are used in stereo matching networks like PSMNet. If I need to train on my own, do I simply need to replace the loss function in the backbone with the loss functions from this repository?

Thank you for your appreciation!

Regarding your concern: certainly, we employed a combination of photometric loss and disparity loss (on labels computed from the neural radiance field), even in the case of networks like PSMNet and CFNet. The code to compute the total loss used for these latter two networks can be found in lines 56-60 at this link.

By the way, if you intend to train PSMNet using our dataset, I recommend utilizing our proposed training loss.

If you have any further questions or concerns, please don't hesitate to ask.

Thank you for your prompt response and for clearing up my doubts!

😨I still have some questions. So, during training, do I need to calculate: d1 = PSMNet(I_l, I_c) and d2 = PSMNet(I_c, I_r), and then use these results to compute the loss? In other words, will I be using PSMNet twice in the training process?

No, that's not the correct procedure. You should start by feeding the deep network with a stereo pair (center-right) chosen from a given triplet. Subsequently, the network will estimate a single disparity map for that input stereo pair. During the training phase, your goal is to minimize the discrepancy between the computed disparity and the filtered disparity with AO information obtained from NeRF (labels available jointly with our dataset). Additionally, you should incorporate the triplet photometric loss by using all the images within the triplet (left-center-right).

I suggest reading this issue further insights related to the augmentation procedure.

Thank you for your prompt response, but I'm a bit unclear, calculating the Triplet Photometric Loss requires using the estimated center image obtained through disparity backward-warping. This would require the disparities between the left image and the center image, as well as between the right image and the center image. Doesn't this imply the need to invoke the backbone twice? And then, do we use this loss function to train the entire stereo matching network?And The ground truth disparities and AO (Ambiguity Occurrence) are obtained from NeRF calculations?

I will address the raised doubts point by point:

  1. Disparity Map Alignment: You don't need to compute disparity maps aligned with both the left and right images. By having a disparity map aligned with the central image only, you can synthesize the reconstructed center image twice, using the actual left and right images of the triplet through a process called backward warping. This concept is well explained in our paper on page 4, under the section 3.3. NeRF-Supervised Training Regime. Additionally, in our repository, you can find code that generates the triplet loss at this link (trinocular_loss). This function takes the three images and the single disparity map aligned with the central image as inputs. Internally, the 'disp_warp' function appropriately warps the left and right images, which are then used to compute the triplet photometric loss.
  2. Total Loss Calculation: Similarly, in the 3.3. NeRF-Supervised Training Regime section of our paper, we explain that the total loss is a combination of two parts: a photometric loss derived from the triplet of images and a disparity loss. This total loss is then used for training the entire stereo network. The specific computation for the total loss can be found in our code snippet from line 55 to 75 in this link.
  3. Disparity Loss Labels: The labels used for the disparity loss are computed from NeRF and then refined with AO (Ambient Occlusion), which is available in our dataset. It's important to note that these labels are not the exact ground truth; rather, they are noisy disparity labels. After careful filtering with AO, they contain a minimized amount of errors that can be used for training.

Thank you so much, I think I've understood these issues now.😃