Inquiring about the potential issues of score and map position offsets caused by the camera position not being at the center of the BEV.

Question

Closed this issue a month ago · 1 comments

Hello, this is a very practical piece of work. I have some questions regarding the source code that I would like to consult:

In conv2d_fft_batchwise, the kernel_padded is padded in the bottom-right corner to match the signal size. I am somewhat unclear on why the padding is done in the bottom-right corner.
My understanding is that conv2d_fft_batchwise is for acceleration, and it actually corresponds to convolution in the time domain. The convolution results obtain a score that represents the similarity between the center of the kernel and each position on the map. However, for the Bird's Eye View (BEV) used as the kernel, the camera position should be at the very bottom center of the BEV, not at the center of the kernel. This indicates that the score does not directly correspond to the map's position but is offset by a translation quantity. This translation becomes more complex when rotating the kernel. The mapping between score and map in the loss and inference does not consider this translation quantity. I wonder if the network is expected to cover this translation through learning, and if so, will this increase the difficulty of learning? Thank you very much.

Answer 1 · 2024-05-15T06:25:17.000Z

In the TemplateSampler, I noticed that the rotation center was set to the very bottom center of the BEV. The code is fine; I misunderstood initially.