pjyi2147/GPU_Computing_lab

CUDA labs and final project parallelizing Quick Hull algorithm to GPU.

C++

GPU Computing Labs

Preface

Exercies on how to use Nvidia GPUs using CUDA and learn optimization methods specific for GPU-parallelized computing.

How to run?

In each chapter folder, there is a readme.md file explaining on how to build and run the template files for each lab.

Topics

A1

Use CUDA APIs to implement vector addition
Learn basic transfer of data between CPU and GPU and memory allocation

A2

Implement tiled dense matrix multiplication using CUDA
Learn how to allocate memory on GPU and transfer data between GPU and CPU
Use shared memory to optimize computation and memory latency
Find the difference of performance on usage of shared memory

A3

Use pinned memory with CUDA streams by implementing vector addition
Benchmark the performance on the usage of CUDA streams by hiding memory latency

A4

Apply convolution to a ppm image using CUDA APIs
Find overhead in using output tiling algorithm
Evaluate performance of output tiling algorithm with different tiling sizes including extreme ones

A5

Perform Histogram Reduction using CUDA APIs
Learn how to use atomics for memory address in GPU through CUDA memory APIs
Analyze performance impact of using atomics

A6

Implement 1D inclusive parallel scan using CUDA and work-efficient algorithm (Brent-Kung)
Learn the restriction of the algorithm based on the commutative feature of binary operator for scan

A7

Implement sparse matrix-vector (SPMV) mutiplication using CUDA and a transposed JDS (Jagged Diagonal Sparse) formatted matrix
Implement the conversion from 2D array to JDS-formatted matrix using C++ and STL vector
Compare the performance difference of the kernel between the version using shared memory and not using shared memory

Project

Compare the performance between the serial (CPU) and parallel verions of Quick Hull algorithm for Convex Hull problem.
Introduced parallelized sorting, reduction, and computation to reduce the runtime of the algorithm
Check submodule for more information