I don't get the logic of the intermediate output

Question

I don't get the logic of the intermediate output

Closed this issue 8 days ago · 1 comments

Hi, I'm trying to use the model to predict given 2 context images.
But it always predicts 3 different gaussian worlds, i.e., means vector of shape: 3xNx3
I see that in the training step you call it "intermediate output" or "supervise_intermediate_depth"

What's the purpose of that? Why do you get 3 different depth maps of each context image?
Thank you

Answer 1 · 2025-01-05T16:50:12.000Z

Hi, I guess you are running the base model, where it predicts 3 depth maps at different resolutions. Since we predict residual depths at higher scales, we add supervisions to all the intermediate depth predictions to avoid the depths at earlier stages becoming arbitrary. This is a common strategy in coarse-to-fine architectures and it empricially improves the performance.