The speedup of using the cuda operation compared with PyTorch native operations.

Question

The speedup of using the cuda operation compared with PyTorch native operations.

Closed this issue 2 years ago · 3 comments

Thanks for your good job. I wonder the speedup of using the cuda operation compared with PyTorch native operations. And is there a good tutorial to start the cuda programming.

Answer 1 · 2022-09-07T06:15:04.000Z

You can run python ./fs_plugins/custom_ops/dag_loss.py to test the speedup.

On my device (V100-32G), the output is

########### Forward Tuning #############
1: 0.001267 0.030202 23.84   # It means, with the hyperparameter setting 1, 
                             # the cuda dag_loss forward costs 0.001s and the pytorch dag_loss forward costs 0.03s, 
                             # with a speedup of 23.84
2: 0.001420 0.029709 20.92
3: 0.001889 0.029700 15.72
4: 0.002805 0.029724 10.60
Best Choice: 1
########### Backward Tuning #############
(1, 1): 0.000601 0.014568 24.25
(1, 2): 0.000603 0.014654 24.31
(1, 3): 0.000587 0.014575 24.84
(2, 1): 0.000579 0.014831 25.60
(2, 2): 0.000594 0.014661 24.68
(2, 3): 0.000572 0.014568 25.49
Best Choice: (2, 1)
########### Align Tuning #############
1: 0.000617 0.023303 37.75
2: 0.000724 0.023330 32.21
3: 0.000972 0.023621 24.31
4: 0.001419 0.023319 16.43
Best Choice: 1
########### Test Gather #############
0.011539 0.030941 2.68

So the cuda operations achieves about 20~30x speedup in dag_loss and dag_best_alignment, and 2x speedup in dag_logsoftmax_gather_inplace. However, the most time cosuming part lies in the Transformer architecture, so the overall speedup is about 20%. You can use lightseq to to accelerate the Transformer, achieving about 2~3x overall training speedup.

Actually, I implement the cuda operations mainly for saving GPU memory, because the DP costs too much memory (about O(L^3), 50% of the max memory when training) and seriously affects the batch size.

Answer 2 · 2022-09-07T06:17:41.000Z

I am not an expert in cuda programing (and I think dag_loss can be further accelerated with a more careful implementation). A Chinese tutorial collection is https://zhuanlan.zhihu.com/p/346910129, which is my start.

Answer 3 · 2022-09-07T06:38:43.000Z

Thank a lot.