sukjunhwang/IFC

Code explanation

Closed this issue · 2 comments

Hello,

First of all, great paper! I just have one question. Would you mind helping me understand why only the last feature map is used in the transformer? Aren't you losing information by discarding the others?

src, mask = features[-1].decompose()

Hi @cyrilzakka ,

Thank you for your interest in our work.

As you noted, it would be better to use multiple levels of features such as res2, res3, and res4 (stride of 4, 8, 16, respectively.
However, we use only res5 sized feature (stride of 32) due to computational burden; this follows the architectural design of DETR.
If multiple levels are involved, the computations dramatically increase due to the inherent nature of transformers: quadratic increase.
Therefore, DETR only uses res5 feature in order to alleviate the issue.

Recently, there have been many approaches to both utilize and afford multiple levels, and I suggest you to refer to Mask2Former if interested.

Thank you :)

Thanks!