/PiOnGPU

Compute Pi on GPU with CUDA, using BBP algorithm

Primary LanguageC++

PiOnGPU
Compute Pi on GPU with CUDA using BBP algorithm.
I've done a fundamantal optmization, but shared memory seems that didn't behave as I wish.
I'm puzzled but still learning the new way to optimize it.

#######################################
## Software Environment
#######################################
Windows 10
Visual Studio 2013 Ultimate, and run in Release configuration
CUDA 7.5
20000 hex digits precision
Precision of timing is millisecond

#######################################
## Hardware Environment 1
#######################################
Brand: ThinkPad T450(Energy saving disabled)
ChipSet: Intel 9 series
CPU: Intel Core i5-5200U 2.20GHz
DRAM: DDR3 1600MHz 12GB

#######################################
## Information of nVIDIA GT940M
#######################################
CUDA Driver Version / Runtime Version          7.5 / 7.5
CUDA Capability Major/Minor version number:    5.0
Total amount of global memory:                 1024 MBytes (1073741824 bytes)
( 3) Multiprocessors, (128) CUDA Cores/MP:     384 CUDA Cores
GPU Max Clock rate:                            980 MHz (0.98 GHz)
Memory Clock rate:                             1001 Mhz
Memory Bus Width:                              64-bit
L2 Cache Size:                                 1048576 bytes
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       49152 bytes
Total number of registers available per block: 65536
Max dimension size of a thread block (x,y,z):  (1024, 1024, 64)
Max dimension size of a grid size    (x,y,z):  (2147483647, 65535, 65535)
Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
Run time limit on kernels:                     No
Integrated GPU sharing Host Memory:            No
CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
Warp Size:                                     32

#######################################
## Test Result on nVIDIA GT940M
#######################################
threads per blocak      1024            512             256
No optimization:        6940.083887     6430.943848     6300.090039
Constant memory only:   6938.662695     6429.580566     6330.754102
Shared memory only:     7187.065234     6897.159473     6290.993066
Both memory enabled:    7000.014746     6537.461328     6380.222168

(Cont.)
threads per blocak      128             64
No optimization:        6353.020801     6304.920996
Constant memory only:   6355.743457     6276.935156
Shared memory only:     6300.069043     6278.295605
Both memory enabled:    6357.507031     6329.923535


#######################################
## Hardware Environment 2
#######################################
Brand: HP 
ChipSet: Intel H57
CPU: Intel Core i5-760 2.80GHz
DRAM: DDR3 1600+1333MHz 8+2GB

#######################################
## Information of nVIDIA GTX750Ti
#######################################
CUDA Driver Version / Runtime Version          7.5 / 7.5
CUDA Capability Major/Minor version number:    5.0
Total amount of global memory:                 2048 MBytes (2147483648 bytes)
( 5) Multiprocessors, (128) CUDA Cores/MP:     640 CUDA Cores
GPU Max Clock rate:                            1189 MHz (1.19 GHz)
Memory Clock rate:                             2700 Mhz
Memory Bus Width:                              128-bit
L2 Cache Size:                                 2097152 bytes
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       49152 bytes
Total number of registers available per block: 65536
Max dimension size of a thread block (x,y,z):  (1024, 1024, 64)
Max dimension size of a grid size    (x,y,z):  (2147483647, 65535, 65535)
Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
Run time limit on kernels:                     Yes
Integrated GPU sharing Host Memory:            No
CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
Warp Size:                                     32  

#######################################
## Test Result on nVIDIA GTX750Ti
#######################################
threads per blocak      1024            512             256
No optimization:        3745.776123     3260.284180     3200.490479
Constant memory only:   3787.979980     3260.239746     3214.660400
Shared memory only:     3832.115479     3278.210693     3205.545166
Both memory enabled:    3830.820068     3293.498291     3182.395264

(Cont.)
threads per blocak      128             64
No optimization:        3062.325928     3045.280273
Constant memory only:   3063.468758     3045.000244
Shared memory only:     3195.489746     3134.976807
Both memory enabled:    3085.062988     3103.204834

#######################################
## Hardware Environment 3
#######################################
DIY PC
ChipSet: Intel Z170
CPU: Intel Core i7-6700k 4.0GHz
DRAM: DDR4 2133MHz 4*4GB

#######################################
## Information of nVIDIA GTX970
#######################################
CUDA Driver Version / Runtime Version          7.5 / 7.5
CUDA Capability Major/Minor version number:    5.2
Total amount of global memory:                 4096 MBytes (4294967296 bytes)
(13) Multiprocessors, (128) CUDA Cores/MP:     1664 CUDA Cores
GPU Max Clock rate:                            1342 MHz (1.34 GHz)
Memory Clock rate:                             3505 Mhz
Memory Bus Width:                              256-bit
L2 Cache Size:                                 1835008 bytes
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       49152 bytes
Total number of registers available per block: 65536
Max dimension size of a thread block (x,y,z):  (1024, 1024, 64)
Max dimension size of a grid size    (x,y,z):  (2147483647, 65535, 65535)
Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
Run time limit on kernels:                     Yes
Integrated GPU sharing Host Memory:            No
CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
Warp Size:                                     32

#######################################
## Test Result on nVIDIA GTX970
#######################################
threads per blocak      1024            512             256
No optimization:        2226.527588     1674.075073     1523.615967
Constant memory only:   2226.088135     1673.547241     1520.018677
Shared memory only:     2228.230225     1704.942993     1521.855469
Both memory enabled:    2227.388428     1671.705811     1521.243652

(Cont.)
threads per blocak      128             64
No optimization:        1463.327637     1460.334229
Constant memory only:   1464.966675     1462.262085
Shared memory only:     1463.972290     1461.706665
Both memory enabled:    1460.544312     1486.251953

*All the test Every result is an average of 5 repeat running.