The GPU architecture has many blocks which in turn contains multiple threads which are capable of executing operations in parallel.
GPU is optimized for throughput, but not necessarily for latency.
Each GPU core is slow but there are thousands of it.
GPU works well for massively parallel tasks such as matrix multiplication, but it can be quite inefficient for tasks where massive parallelization is impossible or difficult.
These are the main steps to run you programme on parallel threads of GPU
- Initate The Input Data in HOST(CPU)
- Allocate Memory on Device(GPU) for input and output variables
- Copy the Input data from HOST to DEVICE
- Launch a kernel (call the GPU code)
- Copy the output from DEVICE to HOST
- FREE the Allocated memory on GPU
Programme | Links |
---|---|
Square of numbers | Click Here |
Adding Vectors | Click Here |
Barrier Synchronisation | Click Here |
Vector Multiplication | Click Here |