This repository contains basic CUDA implementations for beginners that want to access full and runable codes, change input/block/grid sizes and see how CUDA spreads threads on SMs for debugging and better understanding of CUDA structure. Steps are explained below:
The code in create-inputs.cpp
writes random floating point numbers as a 1D array in two separate files, inp1.txt
and inp2.txt
. The first file is used for simple vector addition and both files are used as inputs to matrix multiplication. Simply, build create_inputs
and write the number of elements as the command line options. Example:
$ g++ -O3 -o create_inputs create_inputs.cpp
$ ./create_inputs 16
The code in addition.cu
reads inp1.txt
and performs a simple vector addition on the GPU. Simply, build the code and specify the block size in input command. Example:
$ nvcc -o addition -Xptxas -O3 addition.cu
$ ./addition 16
--------------------------------------------------
Device Number: 0
Device name: NVIDIA GeForce RTX 3080
SM count: 68
Cores per SM: 128
Max Blocks per SM: 16
Max Threads per Block: 1024
Max Threads per SM: 1536
--------------------------------------------------
Read 16 elements from inp1.txt
SM(0) | Block(0,0,0) | Thread(0,0,0) | TID(0) -> -9.25 + 1
SM(0) | Block(0,0,0) | Thread(1,0,0) | TID(1) -> 8.08 + 1
SM(0) | Block(0,0,0) | Thread(2,0,0) | TID(2) -> 1.30 + 1
SM(0) | Block(0,0,0) | Thread(3,0,0) | TID(3) -> -6.62 + 1
SM(0) | Block(0,0,0) | Thread(4,0,0) | TID(4) -> -7.49 + 1
SM(0) | Block(0,0,0) | Thread(5,0,0) | TID(5) -> 2.22 + 1
SM(0) | Block(0,0,0) | Thread(6,0,0) | TID(6) -> 2.01 + 1
SM(0) | Block(0,0,0) | Thread(7,0,0) | TID(7) -> 7.18 + 1
SM(0) | Block(0,0,0) | Thread(8,0,0) | TID(8) -> -5.34 + 1
SM(0) | Block(0,0,0) | Thread(9,0,0) | TID(9) -> -2.97 + 1
SM(0) | Block(0,0,0) | Thread(10,0,0) | TID(10) -> -0.16 + 1
SM(0) | Block(0,0,0) | Thread(11,0,0) | TID(11) -> 1.12 + 1
SM(0) | Block(0,0,0) | Thread(12,0,0) | TID(12) -> 6.48 + 1
SM(0) | Block(0,0,0) | Thread(13,0,0) | TID(13) -> 9.48 + 1
SM(0) | Block(0,0,0) | Thread(14,0,0) | TID(14) -> -8.82 + 1
SM(0) | Block(0,0,0) | Thread(15,0,0) | TID(15) -> -1.27 + 1
addABC elapsed time : 0.912384 ms
SM map:
0| 16 0 0 0 0 0 0 0
8| 0 0 0 0 0 0 0 0
16| 0 0 0 0 0 0 0 0
24| 0 0 0 0 0 0 0 0
32| 0 0 0 0 0 0 0 0
40| 0 0 0 0 0 0 0 0
48| 0 0 0 0 0 0 0 0
56| 0 0 0 0 0 0 0 0
64| 0 0 0 0
Done
$ ./addition 8
--------------------------------------------------
Device Number: 0
Device name: NVIDIA GeForce RTX 3080
SM count: 68
Cores per SM: 128
Max Blocks per SM: 16
Max Threads per Block: 1024
Max Threads per SM: 1536
--------------------------------------------------
Read 16 elements from inp1.txt
SM(2) | Block(1,0,0) | Thread(0,0,0) | TID(8) -> -5.34 + 1
SM(2) | Block(1,0,0) | Thread(1,0,0) | TID(9) -> -2.97 + 1
SM(2) | Block(1,0,0) | Thread(2,0,0) | TID(10) -> -0.16 + 1
SM(2) | Block(1,0,0) | Thread(3,0,0) | TID(11) -> 1.12 + 1
SM(2) | Block(1,0,0) | Thread(4,0,0) | TID(12) -> 6.48 + 1
SM(2) | Block(1,0,0) | Thread(5,0,0) | TID(13) -> 9.48 + 1
SM(2) | Block(1,0,0) | Thread(6,0,0) | TID(14) -> -8.82 + 1
SM(2) | Block(1,0,0) | Thread(7,0,0) | TID(15) -> -1.27 + 1
SM(0) | Block(0,0,0) | Thread(0,0,0) | TID(0) -> -9.25 + 1
SM(0) | Block(0,0,0) | Thread(1,0,0) | TID(1) -> 8.08 + 1
SM(0) | Block(0,0,0) | Thread(2,0,0) | TID(2) -> 1.30 + 1
SM(0) | Block(0,0,0) | Thread(3,0,0) | TID(3) -> -6.62 + 1
SM(0) | Block(0,0,0) | Thread(4,0,0) | TID(4) -> -7.49 + 1
SM(0) | Block(0,0,0) | Thread(5,0,0) | TID(5) -> 2.22 + 1
SM(0) | Block(0,0,0) | Thread(6,0,0) | TID(6) -> 2.01 + 1
SM(0) | Block(0,0,0) | Thread(7,0,0) | TID(7) -> 7.18 + 1
addABC elapsed time : 0.903168 ms
SM map:
0| 8 0 8 0 0 0 0 0
8| 0 0 0 0 0 0 0 0
16| 0 0 0 0 0 0 0 0
24| 0 0 0 0 0 0 0 0
32| 0 0 0 0 0 0 0 0
40| 0 0 0 0 0 0 0 0
48| 0 0 0 0 0 0 0 0
56| 0 0 0 0 0 0 0 0
64| 0 0 0 0
Done
The above commands read a vector with 32 elements and performs the addition with block sizes of 16 and 8. In the first command, only one block with 16 threads is created and that is offloaded to SM(0). In the second command, two blocks each with 8 threads are created and they are offloaded on SM_0 and SM_2. Note that if you want to compile for a specific architecture you can add -arch=sm_XX
. See this page to find the corresponding SM number for an architecture. As another run, the following outputs show the SM usage map for 16K elements with block sizes of 4, 128 and 1024.
$ ./create_inputs 16384
$ ./addition 4
...
addABC elapsed time : 0.01536 ms
SM map:
0| 240 228 240 224 236 228 236 224
8| 280 264 268 256 244 224 236 228
16| 236 228 240 216 272 276 272 256
24| 236 228 232 228 224 228 224 228
32| 264 256 284 268 232 228 228 232
40| 232 224 232 224 272 276 288 272
48| 232 232 228 236 232 224 232 224
56| 260 240 276 256 240 216 240 220
64| 232 220 232 220
Done
$ ./addition 128
...
addABC elapsed time : 0.012288 ms
SM map:
0| 256 256 256 256 256 256 256 256
8| 256 256 256 256 256 256 256 256
16| 256 256 256 256 256 256 256 256
24| 256 256 256 256 256 256 256 256
32| 256 256 256 256 256 256 256 256
40| 256 256 256 256 256 256 256 256
48| 256 256 256 256 256 128 256 128
56| 256 128 256 128 256 128 256 128
64| 256 128 256 128
Done
$ ./addition 1024
...
addABC elapsed time : 0.012288 ms
SM map:
0| 1024 0 1024 0 1024 0 1024 0
8| 1024 0 1024 0 1024 0 1024 0
16| 1024 0 1024 0 1024 0 1024 0
24| 1024 0 1024 0 1024 0 1024 0
32| 0 0 0 0 0 0 0 0
40| 0 0 0 0 0 0 0 0
48| 0 0 0 0 0 0 0 0
56| 0 0 0 0 0 0 0 0
64| 0 0 0 0