Pointcept/PointTransformerV2

Why the first position encoding module set the 48 not the classical 32?

hengyu2333 opened this issue · 4 comments

I have considered the Point Transformer V1 and V2.
I have two questions about the V2:
Why use the MLP to map the data to 48, not the classical 32?
How to dispose of the inexhaustible points on the downsampling module?

Hi, Thanks for being interested in our work, the answer is as follows:

  1. It is just a parameter adjustment like Transformers for 2D images. Transformers are parameter efficient compared with Conv-based models. Scaling up base channels can increase model capacity.
  2. I'm sorry for not understanding the question well. Grid sizes of the real-world distance control the number of points. And as mentioned in our paper, grid sizes have a stable approximate sampling rate. As in previous works, the initial number of points fed into the network is controlled during the data augmentation process (Voxelization, Crop).

Thanks a lot. I did not express my second question clearly. For example, with 1024 as input, the classical use a 1/4 downsample to get a 256 sample set. But in your paper, the experience has considered the 1/6, So how to assess the 1024//6 situation? Did the sample set as 170 or 171 if input as 1024? Did the input data been considered 6X?

1/6 is an approximate estimation of sample rate after Grid Pool. Vary from Sampling-based Pooling method controlled by sample rate, Grid Pooling is controlled by grid size. For example, the basic grid size during Voxelization (I prefer call it grid sampling) is 0.02m, then the first stage grid size is 0.06m (x3 mentioned in model setting). But after Gird Pool with real-world grid size, we found that the sample rate is also pretty stable. That is the approximate sampling rate we introduce in ablation (~1/6) .

Thank you very much for your answer.