Metrics results

Question

Metrics results

Nina-Konovalova opened this issue a year ago · 3 comments

Thank you very much for your work!

I'd like to ask a question about the evaluation on 3nerf dataset. As I run for 100 random photos with baseline 0.50 - the obtained results seem to be relatively poor.

EPE: 2.5572
bad 1.0: 41.63%
bad 2.0: 19.63%
bad 3.0: 12.59%

While running on random 100 photos with baseline 0.10 seem to be much better
EPE: 0.3576
bad 1.0: 3.93%
bad 2.0: 1.70%
bad 3.0: 1.06%

Should I do some disparity preprocessing steps before evaluation to obtain good results? Should some additional preprocessing steps be considered while training?

Answer 1 · 2023-06-23T15:48:12.000Z

Hello, the issue lies in the fact that you should not evaluate the network's predictions on the disparity maps obtained from NeRF, as they cannot be considered as ground truth. The evaluation should be conducted on major stereo benchmarks such as KITTI and Middlebury.

Answer 2 · 2023-06-26T10:11:06.000Z

Thank you very much for the answer!

But as I understand, we train Stereo models only on NeRF dataset and then test on other data. So why don't we have good results on training data? Actually we have very different quality on different baselines.

And should we conduct any additional preprocessing steps for nerf disparity or we need only augmentations from raft-stereo?

Answer 3 · 2023-06-28T12:51:15.000Z

I apologize for the delay in my response. In these past few days, I have been busy due to the CVPR conference, and I couldn't respond promptly.

To address your question, it's important to clarify whether the evaluation was conducted on the disparity maps filtered with uncertainty (AO) or on the dense disparity maps. If the former, I suggest evaluating only on the points considered more reliable after removing outliers. For further guidance on filtering unreliable points, I recommend reading the paper, which provides detailed insights.

Additionally, it's worth considering that evaluating on disparity maps obtained from a larger baseline will inevitably lead to higher errors compared to evaluating on a smaller baseline, as the disparity values are larger.

However, I do not recommend relying solely on this approach to assess the quality of trained networks (as mentioned before). Instead, I recommend evaluating them on benchmarks that provide highly accurate ground truth disparity maps.