PiOnGPU Compute Pi on GPU with CUDA using BBP algorithm. I've done a fundamantal optmization, but shared memory seems that didn't behave as I wish. I'm puzzled but still learning the new way to optimize it. ####################################### ## Software Environment ####################################### Windows 10 Visual Studio 2013 Ultimate, and run in Release configuration CUDA 7.5 20000 hex digits precision Precision of timing is millisecond ####################################### ## Hardware Environment 1 ####################################### Brand: ThinkPad T450(Energy saving disabled) ChipSet: Intel 9 series CPU: Intel Core i5-5200U 2.20GHz DRAM: DDR3 1600MHz 12GB ####################################### ## Information of nVIDIA GT940M ####################################### CUDA Driver Version / Runtime Version 7.5 / 7.5 CUDA Capability Major/Minor version number: 5.0 Total amount of global memory: 1024 MBytes (1073741824 bytes) ( 3) Multiprocessors, (128) CUDA Cores/MP: 384 CUDA Cores GPU Max Clock rate: 980 MHz (0.98 GHz) Memory Clock rate: 1001 Mhz Memory Bus Width: 64-bit L2 Cache Size: 1048576 bytes Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model) Warp Size: 32 ####################################### ## Test Result on nVIDIA GT940M ####################################### threads per blocak 1024 512 256 No optimization: 6940.083887 6430.943848 6300.090039 Constant memory only: 6938.662695 6429.580566 6330.754102 Shared memory only: 7187.065234 6897.159473 6290.993066 Both memory enabled: 7000.014746 6537.461328 6380.222168 (Cont.) threads per blocak 128 64 No optimization: 6353.020801 6304.920996 Constant memory only: 6355.743457 6276.935156 Shared memory only: 6300.069043 6278.295605 Both memory enabled: 6357.507031 6329.923535 ####################################### ## Hardware Environment 2 ####################################### Brand: HP ChipSet: Intel H57 CPU: Intel Core i5-760 2.80GHz DRAM: DDR3 1600+1333MHz 8+2GB ####################################### ## Information of nVIDIA GTX750Ti ####################################### CUDA Driver Version / Runtime Version 7.5 / 7.5 CUDA Capability Major/Minor version number: 5.0 Total amount of global memory: 2048 MBytes (2147483648 bytes) ( 5) Multiprocessors, (128) CUDA Cores/MP: 640 CUDA Cores GPU Max Clock rate: 1189 MHz (1.19 GHz) Memory Clock rate: 2700 Mhz Memory Bus Width: 128-bit L2 Cache Size: 2097152 bytes Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model) Warp Size: 32 ####################################### ## Test Result on nVIDIA GTX750Ti ####################################### threads per blocak 1024 512 256 No optimization: 3745.776123 3260.284180 3200.490479 Constant memory only: 3787.979980 3260.239746 3214.660400 Shared memory only: 3832.115479 3278.210693 3205.545166 Both memory enabled: 3830.820068 3293.498291 3182.395264 (Cont.) threads per blocak 128 64 No optimization: 3062.325928 3045.280273 Constant memory only: 3063.468758 3045.000244 Shared memory only: 3195.489746 3134.976807 Both memory enabled: 3085.062988 3103.204834 ####################################### ## Hardware Environment 3 ####################################### DIY PC ChipSet: Intel Z170 CPU: Intel Core i7-6700k 4.0GHz DRAM: DDR4 2133MHz 4*4GB ####################################### ## Information of nVIDIA GTX970 ####################################### CUDA Driver Version / Runtime Version 7.5 / 7.5 CUDA Capability Major/Minor version number: 5.2 Total amount of global memory: 4096 MBytes (4294967296 bytes) (13) Multiprocessors, (128) CUDA Cores/MP: 1664 CUDA Cores GPU Max Clock rate: 1342 MHz (1.34 GHz) Memory Clock rate: 3505 Mhz Memory Bus Width: 256-bit L2 Cache Size: 1835008 bytes Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model) Warp Size: 32 ####################################### ## Test Result on nVIDIA GTX970 ####################################### threads per blocak 1024 512 256 No optimization: 2226.527588 1674.075073 1523.615967 Constant memory only: 2226.088135 1673.547241 1520.018677 Shared memory only: 2228.230225 1704.942993 1521.855469 Both memory enabled: 2227.388428 1671.705811 1521.243652 (Cont.) threads per blocak 128 64 No optimization: 1463.327637 1460.334229 Constant memory only: 1464.966675 1462.262085 Shared memory only: 1463.972290 1461.706665 Both memory enabled: 1460.544312 1486.251953 *All the test Every result is an average of 5 repeat running.