CUDA programming C++

The most common deep learning frameworks such as Tensorflow and PyThorch often rely on kernel calls in order to use the GPU to compute parallel computations and accelerate the computation of such networks. The most famous interface that allows developers to program using the GPU is CUDA, created by NVIDIA. This repository will keep track of my progress in this area. I will base it mainly on what I'm learning man by man from my master in deep learning run by Deep Learning Italia Academy, on Udemy CUDA programming Masterclass with C++ and also of course on NVIDIA documentation.

My purpose is to deepen my knowledge about parallel programming!

parallel_cube

In this repository :

  • Hello World

    I learned key concepts such as host (cpu) and device (gpu) computation, the context switch method, and the apparent parallel execution of cpu. The difference between process and thread, how threads share memory. I know that there are 2 level of prallelism (1) task level and (2) data level. The difference between parallelism and concurrency. Finally I was able to launch the kernel using the grid and block parameters

  • Threads Organization

    Often figuring out how and which threads access the kernel function is difficult. I have learned to use variables of type dim3 blockIdx, blockDim, gridDim to identify them.

  • Unique Index Calculation

    Often identifying unique thread IDs can be difficult, especially when using grids and 2 or even 3 dimensional blocks. Here I solve this problem

  • Memory Transfer

    In addition to processing data on the GPU, we also need to transfer data from the CPU to the GPU, and transfer the results back.

  • Sum Array

    Let's transfer and sum 2 arrays in GPU. Monitor the time needed using clocks, and also lets handle the CUDA errors creating a macro and wrapping all the CUDA functions.

  • Device Query

    Here is a simple script to query on the fly our device and get its properties

  • Intro to Warps

    We should consider the parallelism between software and hardware. Since each core of a SM can execute in parallel only a single warp (32 thread) this should be the otimal number oh threads in a block. If we 1 single thread in a block, the hardware will still assign a warp of 32 with resources for 32 threads, but 31 of htem will be inactive, and it will be a waste of resources.

  • Wrap Divergence

    Wrap divergence is an issue for prallel computing. Part of the wrap, and so part of the NVIDIA SM can be disabled, and you can waste resources. Pay attention to if-else statments. You can check the branch_efficiencu metric using compiling with nvcc and running nvprof