ubc-vision/COTR

question

Closed this issue · 1 comments

1.Can I understand it this way? The value of C is the pixel value of the matching point in the right image? Or is it a relational matrix?
2.what means correspondence map C has 2 channels?
3.I still have questions to ask you, which layer of the network is C obtained, transformer or MLP, and what are the specific dimensions? Is it a 256-dimensional vector? What do the two channels mentioned in the reply refer to? Does the query pionts input into the network have 256 dimensions and only have position coordinates of 0 and 1? Or is it a 2-dimensional integer coordinate value?

  1. Yes, the value of C is pixel location of the matching point in the right image.
  2. Pixel location is represented as XY, therefore the C has 2 channels.
  3. C is not a direct output of the network, COTR is a query based network, it only returns 1 answer for 1 query. I’m not sure I follow your question, but the demo code and paper diagram should provide enough details regarding the input and output. In general you can think the coordinates values are normalized to 0 to 1.