Time using CUDA is more as compared to CPU
mr-yamraj opened this issue · 3 comments
Commands:
!git clone https://github.com/pytorch/extension-cpp
%cd extension-cpp/cpp/
!python setup.py install
%cd ../cuda
!python setup.py install
%cd ..
!python benchmark.py py -r 100000
!python benchmark.py cpp -r 100000
!python benchmark.py cuda -r 100000
!python benchmark.py py --cuda -r 100000
!python benchmark.py cpp --cuda -r 100000
!python benchmark.py cuda --cuda -r 100000
The K80 enabled servers in google colab are not very fast and their architecture (Kepler) is not as easily usable more recent ones like Turing or Pascal
You can at least see that the pure cuda implementation is the fastest.
If you want to see real benefits from using torch cpp or pure python with cuda, you can try to add the workload, with e.g. a greater batch size.
I tried with a batch size of 64 instead of 16 and here are my results :
!python benchmark.py py --cuda -b64 -r 1000
!python benchmark.py cpp --cuda -b64 -r 1000
!python benchmark.py cuda --cuda -b64 -r 1000
->
Forward: 327.349/372.224 us | Backward 553.608/714.348 us
Forward: 257.730/307.776 us | Backward 931.501/1132.617 us
Forward: 201.941/245.529 us | Backward 471.115/645.371 us
and
!python benchmark.py py -b64 -r 1000
!python benchmark.py cpp -b64 -r 1000
!python benchmark.py cuda -b64 -r 1000
->
Forward: 491.858/543.436 us | Backward 710.726/835.553 us
Forward: 420.809/493.803 us | Backward 897.169/1071.797 us
Forward: 202.179/240.434 us | Backward 483.751/624.685 us
Thank you very much that really helps.