Input resolution seems to have a big impact on performance
Closed this issue · 11 comments
Hi, I find that the input resolution seems to have a very big impact on the performance.
When I keep the input shape unchanged, e.g. 640 x 480, everything seems fine, I can reproduce the metrics claimed in the paper on NYU test set.
However, when I change the input resolution, e.g. resize every image to 384 x 288, the RMSE and other metric show a notable drop。 The RMSE drops from 0.33 to 0.95. Why does this happen? Requiring the same resolution as training in inference stage
seems unreasonable.
Thanks
There may be many reasons. One obvious reason is that the intrinsics of the camera are changed when you resize the images, but the network works on the intrinsics as that in the training.
@weihaosky Your explanation sounds make sense. That is to say, the model that predicts the absolute depth can only be trained on images collected with cameras that has the same intrinsics? This seems to be a big limit. In your opinion, is there any good way to generalize the model to images captured with other cameras?
You can transform the intrinsics to make it the same as the cameras in training, as in demo.py
@weihaosky @Tord-Zhang I have the same question. I tried reading the demo.py but I am confused about where exactly I should make changes in eval.py/ train.py/ test.py if I want to train and evaluate on my own dataset. Also there is a variable called 'focal' in dataloader set to 518.8579. Could you please explain where exactly it is used? -- I might have missed it but could not find where exactly it is being used. Any help of this would be greatly appreciated. Thanks a lot!
Kalyani
It seems like the focal value from the NYU intrinsic parameters were used in the KITTI Dataloader, is this correct? Why is this the case? Maybe I have overlooked something.
NYU camera intrinsic values shown in BTS:
https://github.com/cleinc/bts/blob/dd62221bc50ff3cbe4559a832c94776830247e2e/pytorch/bts_live_3d.py#L87
NYU camera intrinsics in KITTI dataloader:
It seems like the focal value from the NYU intrinsic parameters were used in the KITTI Dataloader, is this correct? Why is this the case? Maybe I have overlooked something.
NYU camera intrinsic values shown in BTS: https://github.com/cleinc/bts/blob/dd62221bc50ff3cbe4559a832c94776830247e2e/pytorch/bts_live_3d.py#L87
NYU camera intrinsics in KITTI dataloader:
This parameter is not used
Ok thank you for the clarification! How does the network predict metric depths without the focal parameter? Or is the focal information loaded and used somewhere else?
Thank you for the quick response!
Ok thank you for the clarification! How does the network predict metric depths without the focal parameter? Or is the focal information loaded and used somewhere else?
Thank you for the quick response!
The focal parameter for one training dataset is the same, so there is no need to input the focal parameter.
As far as I see, the parameters are slightly different across the KITTI dataset, for the test they are all the same: 721.5377
https://github.com/aliyun/NeWCRFs/blob/d327bf7ca8fb43959734bb02ddc7b56cf283c8d9/data_splits/eigen_test_files_with_gt.txt
But the train images have slightly different focal parameters
https://github.com/aliyun/NeWCRFs/blob/d327bf7ca8fb43959734bb02ddc7b56cf283c8d9/data_splits/eigen_train_files_with_gt.txt
Networks could handle this small difference.
Yes, that makes sense.