qq456cvb/Point-Transformers

The functions in pointnet_util.py are completely implemented in Python without cuda?

Closed this issue · 9 comments

Description: in pointnet_util.py, the functions, such as query_ball_point and farthest_point_sample, are completely implemented in Python without CUDA.

My question: are these functions as efficient as those cuda implementation such as the library pointops?

not very efficient yet, when using s3dis dataset, the knn operator will make GPU out of memory easily.

not very efficient yet, when using s3dis dataset, the knn operator will make GPU out of memory easily.

Thanks for your reply! @EricLina
Thus, it is better to use the corresponding CUDA implementation of the operations such as FPS and KNN, isn't it? @qq456cvb

not very efficient yet, when using s3dis dataset, the knn operator will make GPU out of memory easily.

In my experiment, the program runs out of GPU memory when running torch.einsum(...) on 2048 points. Thus, I have to decrease the batch size to 8.
However, it is not the best solution. We need to use the corresponding CUDA implementation.

Yes, I think it would be the best to have them implemented with a custom CUDA kernel. Maybe you can have a look at https://github.com/erikwijmans/Pointnet2_PyTorch.git, which implements grouping/interpolation with custom CUDA kernel.

@EricLina did you work on S3DIS data with this repo?

@EricLina did you work on S3DIS data with this repo?

Right

@EricLina did you work on S3DIS data with this repo?

Right

Could you please elaborate few things:

  1. Did you use the codes here directly for S3DIS or you have made some changes?
  2. Which model did you use? The Hengshuang?
  3. Did you get the similar accuracies with the ones reported on the paper?

Thanks!!

  1. I did not change the code, if you can't run it up, you would better to check your dataset preparations.
  2. Yeah, there are two models using the same name, Hengshuang's and Tsinghua's. I used Hengshaung's model.
  3. using qqcvb's code, I got 69.84 mIoU, (training 48h on single A30, batch_size 2).

For those who would like to use FPS but are not able to compile the custom CUDA kernel for FPS, you could try the FPS implementation from Deep Graph Library (DGL) here

It is much faster than the pure python FPS implementation from https://github.com/yanx27/Pointnet_Pointnet2_pytorch (~18x faster in my case), and it is still pure python code.

Yes, I think it would be the best to have them implemented with a custom CUDA kernel. Maybe you can have a look at https://github.com/erikwijmans/Pointnet2_PyTorch.git, which implements grouping/interpolation with custom CUDA kernel.