naver/croco

Tiling-based Inference

Closed this issue · 4 comments

Hi,

I wonder why the inference results looks so different with images I took and images in dataset.
Bike:
im0
im1
bike
My images:
grill
grill0
grill1

Bike looks so smooth but I just use the same setting for inference.

Best

Hi and thanks for your interest in our work,

This is most likely due to our tiling-based inference scheme. Since we perform inference on tiles, typically of 704px width, while assuming crops at the same location in both images, one limitation of our model is for large disparity values, which typically occur in case of a wide baseline and extremely high resolution images as it seems to be the case in your example. The matching area might not be visible bcause of that.

In such a scenario, one workaround is to downscale images. Alternatively, one can perform a first inference (possibly at lower resolution) to sample tiles more smartly in a second phase. This is something we have tried and mentioned in appendix B.5 of the paper: for each crop, we first predict disparities considering the same crop coordinates in the second image. In case of the presence of high disparity values, we perform a second inference using a shifted crop coordinates in the second image to ensure the matching area is present and update these high disparity values.

I hope that helps,
Best
Philippe

Thank you for your reply.

(1) However, I'm still confused about why the image of the bike (croco/Bicycle1-perfect/im0.png: PNG image data, 2988 x 2008, 8-bit/color RGB, non-interlaced) appears so smooth. As you mentioned, during the inference process, the image is tiled into segments 704 pixels wide, but I can barely notice any grid-like pattern. Is there any possibility that the model is overfitting?

(2) Does this mean that during the training, you first crop the training pairs into (352, 704) and then feed them to the networks?

(1) In the bicycle example, despite the high resolution, the maximum predicted dispartiy is ~145 (and probably the maximum ground-truth disparity is in the same range). So predictions based on tiles are fine. In your other example, the disparity values look really high, in which case the naive tiling strategy of sampling the 352x704 crop in the second image at the same location as in the first image totally fails, leading to really poor prediction. When merging these really poor prediction made for each tile, it just output something totally messed up. It is about overfitting, it is a known failure mode of the tiling approach. For instance, this is the result on the Middlebury test set on the Hoops example where the disparity range is high on the pillar of the stairs, and it is also failing dramatically with tiling artifact highly visible in these areas.
image

(2) Yes. During training, we take random crop of 352x704 in the first image, and consider the crop at the same location and size in the second image. At test time, we perform a sliding window with the same 352x704 window size in the first image and always consider the crop at the same localtion and size in the second image.

Thank you so much for this patient explanation.