google/mannequinchallenge

Clarifications on training pipeline

Opened this issue · 2 comments

Hi guys! Congrats for the great work.

I have been trying to implement the single view training pipeline, following the details on the paper's supplemental material , and I have a few questions regarding the implementation:

  1. In the scale invariant mse loss (equation 8), the second term is divided by N. Shouldn't that be N^2 ? Is this just a typo?

  2. During training you state that you normalize the GT log depths by subtracting a random value from the 40-60th percentile. So you dont actually "normalise" but rather kind of center the map around 0. Since the losses take into consideration relative distance between pixel pairs, how does this normalisation affect performance?
    Moreover, I have generated the MC dataset following the instructions in #14, and I have noticed that the absolute and log values of the GT depth maps (usually in the range [5-50]) are significantly larger than those of the depth maps generated from your single-view pretrained model (usually in the range [0.2-1.5]). Is there some other kind of normalisation that you also perform?

  3. Finally, regarding the "paired" mse loss, do you actually compute the distances between all possible pairs, or you do some sub-sampling? Because this can become really computationally intensive, even for small resolutions. There are potentially n*(n-1)/2 possible pairs for an image with n pixels

Thanks!

Hi,
(1) Yes. I have fixed it in https://www.cs.cornell.edu/~zl548/images/mannequin_depth_cvpr2019_supp_doc.pdf
(2) The centering will not change the log difference between each pair of depth. Since SfM model could have very arbitrary scale, this trick just make sure the depth range is reasonable. If you want to convert your prediction back to original SfM model, you have to use either SfM point or MVS point to scale them back.
(3) If you look at 2.4 in our supplementary material: https://www.cs.cornell.edu/~zl548/images/mannequin_depth_cvpr2019_supp_doc.pdf. I showed the algebra trick to make the all-pairs O(N^2) computed in linear time (and in closed form). So you don't need to sample the pixels.

Thanks for the quick reply