nianticlabs/scoring-without-correspondences

Question about the epipolar cross-attention

Closed this issue · 2 comments

Hello! Thanks for open-sourcing this amazing work!

However, I was confused about the "Epipolar Cross-attention" module proposed in the paper. I wonder how it receives "visual features" extracted by a common backbone, then somehow conditions them with "epipolar-gemetry fitness" information, and then applies MLP to output fitness score, ranking the F/E hypothesis. Could you kindly explain the intuition of mechanism behind the "Epipolar Cross-attention"?

Looking forward to your reply!

Hello! Thank you for your interest in our work!

Let me rephrase the intuition behind our Epipolar Cross-attention layer.

Given two images, A and B, we extract "visual features" by a common backbone. In other papers, eg, LoFTR, we could now apply a transformer-like architecture. This transformer would apply several layers of self and cross-attention. However, self-attention only looks within the same "image/visual features" (A/A and B/B), while cross-attention uses information from the opposite view (A/B and B/A).

Since the cross-attention layer uses all the information from the opposite view, it does not have any mechanism to infer the quality of epipolar geometry estimations. Therefore, we introduce the Epipolar Cross-attention layer. The epipolar cross-attention layer limits the search space in the opposite view to only positions that agree with the epipolar geometry that we are evaluating.

A candidate epipolar geometry gives a point-to-line constraint, and this constraint is what we use in the transformer. During the cross-attention, we select a point in image A and look across its corresponding line in image B.

This generates a new feature volume, that is fed into our last block. The block only needs to evaluate if the attended features have geometric consistency and coherence, and then it assigns a score to them.

Hope this helps!

Thanks for the prompt clarification.