About the the number of superpixels
jaycheney opened this issue · 4 comments
Hello, I would like to ask why the number of SuperPixels in the VoxelNet file you provided is 30. How many Superpixels should I set properly if I use the 3D backbone network based on the pillar?
Hi,
The reason behind this choice is that a pillar-based network provides a lower resolution feature map than our MinkUNet, thus requiring a lower resolution superpixels grid. This number of segment has not been subject to a thorough ablation, but early results showed that it behaves better than our SLIC with 150 segments.
I believe that our adaptation of SLidR to this BEV backbone is a bit crude and probably under-performing, note that it isn't part of the published article. I am still investigating if there is a better way of using this backbone, as there is reasons to believe that the interpolation part is inefficient, and if I do find one, I will update the code in the future.
Hi, The reason behind this choice is that a pillar-based network provides a lower resolution feature map than our MinkUNet, thus requiring a lower resolution superpixels grid. This number of segment has not been subject to a thorough ablation, but early results showed that it behaves better than our SLIC with 150 segments.
I believe that our adaptation of SLidR to this BEV backbone is a bit crude and probably under-performing, note that it isn't part of the published article. I am still investigating if there is a better way of using this backbone, as there is reasons to believe that the interpolation part is inefficient, and if I do find one, I will update the code in the future.
Thank you very much! We're also trying to do that.
Hi, The reason behind this choice is that a pillar-based network provides a lower resolution feature map than our MinkUNet, thus requiring a lower resolution superpixels grid. This number of segment has not been subject to a thorough ablation, but early results showed that it behaves better than our SLIC with 150 segments.
I believe that our adaptation of SLidR to this BEV backbone is a bit crude and probably under-performing, note that it isn't part of the published article. I am still investigating if there is a better way of using this backbone, as there is reasons to believe that the interpolation part is inefficient, and if I do find one, I will update the code in the future.
Dear author, have you considered if the 3D network gets the features of BEV, can the image features also be transferred to the BEV perspective, so that there is no need for interpolation?
Transferring features from a camera perspective to the BEV usually requires good depth estimation that we don't have in self-supervision. That is why we transfer instead the BEV features into the camera frame, but that creates issues as there cannot be a unique association between a pillar of theoretically infinite height, and a pixel.
In practice, we use the same point to pixel association as we did for semantic segmentation, and pool that into superpixels, with interpolation to account for the BEV feature map low resolution.