pytorch/extension-cpp

Time using CUDA is more as compared to CPU

mr-yamraj opened this issue · 3 comments

I am running the same example on google colab and when I checked the time using benchmark script the time in CPU is comparatively less than that of Cuda which is surprising. Can you please provide an explanation to it
Screen Shot 2019-09-11 at 5 54 29 PM

Commands:
!git clone https://github.com/pytorch/extension-cpp
%cd extension-cpp/cpp/
!python setup.py install
%cd ../cuda
!python setup.py install
%cd ..

!python benchmark.py py -r 100000
!python benchmark.py cpp -r 100000
!python benchmark.py cuda -r 100000

!python benchmark.py py --cuda -r 100000
!python benchmark.py cpp --cuda -r 100000
!python benchmark.py cuda --cuda -r 100000

The K80 enabled servers in google colab are not very fast and their architecture (Kepler) is not as easily usable more recent ones like Turing or Pascal

You can at least see that the pure cuda implementation is the fastest.
If you want to see real benefits from using torch cpp or pure python with cuda, you can try to add the workload, with e.g. a greater batch size.

I tried with a batch size of 64 instead of 16 and here are my results :

!python benchmark.py py --cuda -b64 -r 1000
!python benchmark.py cpp --cuda -b64 -r 1000
!python benchmark.py cuda --cuda -b64 -r 1000

->

Forward: 327.349/372.224 us | Backward 553.608/714.348 us
Forward: 257.730/307.776 us | Backward 931.501/1132.617 us
Forward: 201.941/245.529 us | Backward 471.115/645.371 us

and

!python benchmark.py py -b64 -r 1000
!python benchmark.py cpp -b64 -r 1000
!python benchmark.py cuda -b64 -r 1000

->

Forward: 491.858/543.436 us | Backward 710.726/835.553 us
Forward: 420.809/493.803 us | Backward 897.169/1071.797 us
Forward: 202.179/240.434 us | Backward 483.751/624.685 us

Thank you very much that really helps.