ubc-vision/COTR

Possible redundancy in the code

Closed this issue · 3 comments

Hi, I notice that when constructing the Transformer, you always return the intermediate features at this line. However, after feeding them to MLP for corr regression, you only take the prediction over the last layer at this line. So I guess maybe you can set return_intermediate=False to save some memory/computation?

I have another (maybe stupid) question. In your positional embedding here, the order of x/y seems to be reversed? If I concat them by xy_embed = torch.stack([x_embed, y_embed],dim=-1), and print their values:

>>> xy_embed.shape
torch.Size([2, 8, 8, 2])  # this is fine
>>> xy_embed[0, 0, 0]
tensor([1., 1.])  # this is also fine
>>> xy_embed[0, 0, 1]
tensor([2., 1.])  # shouldn't it be [1., 2.]?
>>> xy_embed[0, 5, 1]
tensor([2., 6.])  # shouldn't it be [6., 2.]?

I find their values a bit confusing. Any insight here?

Hey, yes, it sounds a good idea to set return_intermediate=False. I would appreciate it if you can create a PR.
I'm not sure about the positional encoding question, will the permute function fix this ordering problem?

For the positional encoding thing, I guess it's fine if we have it consistent during training and testing :)