/COM4521

Parallel Computing with Graphical Processing Units (GPUs)

Primary LanguageCMIT LicenseMIT

COM4521 Parallel Computing with GPUs

Notes and source code from lectures, labs, and assignments from course on Parallel Computing with Graphical Processing Units led by Dr Paul Richmond at The University of Sheffield (Spring Semester 2020)

Summary of Lab and Assignment Work

  • Managing programming projects in C using Microsoft Visual Studio
  • Lab 1 - Introduction to Visual Studio and C Programming
    1. Create an array of random unsigned (short) integers using the (MSVC runtime) rand function from stdlib.h and calculate their sum, average, max, and min.
    2. Implement a linear congruential generator to increase the range of random integers that can be generated, calculate the sum, average, max, and min of an array of such values.
    3. Write a random_float function by casting the output of the random_uint function from the previous exercise to float type. Calculate the sum, average, max, and min of an array of values generated by the random_float function to practice knowledge of data types and precision.
    4. Create a rudimentary calculator which takes input from the command line.
    5. Modify the basic calculator so that it can read commands from a file.
  • Lab 2 - Memory and performance
    1. Read and print student records from a binary file, utilising pointers, structures, and dynamic memory allocation
    2. Read and print student records from a binary file with a different data format, using dynamic length char arrays rather than a statically defined (fixed length) buffers to hold strings representing student names.
    3. Read, store, and display an arbitrary number of records using a (doubly) linked list data structure. Implementing a linked list data structure allows for extensive practice at allocating and freeing heap memory, as well practice in correctly type-casting pointers and function pointers.
    4. Optimize a matrix multiplication program, and write code for performance profiling and benchmarking. Improvements include: 1. removing unnecessary memory accesses into large arrays for writing at intermediate steps, instead using a local variable; 2. Switching on compiler opptimizations; and 3. Pre-transposing the right-hand matrix before multiplication to avoid inefficient column-wise memory access
  • Lab 3 - OpenMP
    1. Parallelize a matrix multiplication program using OpenMP pragmas, defining the scope of each variable in the parallel block: private to each thread, vs. shared between threads. Benchmarking performance with omp_get_wtime rather than clock.
    2. Generate images of the Mandelbrot set, saving in the PPM image file format. Parallelize the program using OpenMP pragmas and compare performance of:
      • Different methods for incrementing histogram frequency counters whilst avoiding race conditions;
        • Thread local variables; Critical sections; Atomic directives
      • Different parallel thread workload scheduling: Dynamic, Static, Guided
  • Managing GPU programming projects written in CUDA using Microsoft Visual Studio
  • Lab 4 - Introduction to GPU programming with CUDA
    1. Implement an encryption/decryption program using an affine cipher. Decrypt a file of ciphertext.
      • Practices allocating and moving memory to and from a GPU device, configuring thread blocks, and writing and launching GPU kernel functions.
    2. Write CUDA code to implement vector addition using a GPU, validate against a CPU implementation.
      • Dynamically allocate and move memory with cudaMalloc, cudaMemcpy, and cudaFree.
    3. Write CUDA code to implement matrix addition using a GPU, validate against a CPU implementation.
      • Develops understanding of CUDA 2D thread block layout
  • Lab 5 - CUDA Memory
    1. Implement vector addition using statically defined global device memory and cudaMemcpyToSymbol, time kernel execution with CUDA event timers, and calculate theoretical bandwidth to compare with measured bandwidth.
    2. Optimize a ray tracing algorithm to draw a bird's eye view of a scene containing coloured spheres.
      • Develops understanding of uses, benefits, and limitations of different device memory caches: global, read-only, and constant device memory.
    3. Use 1D and 2D GPU texture memory to implement image blurring on a picture of a dog.
  • Assignment 1
    • We simulate and visualize a system of N bodies in frictionless 2D space moving under gravity (without collision mechanics), providing both a serial CPU implementation and a parallel implementation for multi-core processors using OpenMP.
      • See here for further background on the physics of the N-body problem.
      • We simulate the progression of the N-body system through time using the Forward Euler method for numerical integration.
    • This program allows the user to set the number of bodies to simulate (random initial positions), or to pass a CSV file of initial positions, velocites, and masses through command line arguments/options.
    • Visualization is achieved using OpenGL libraries, or the user can choose to simulate for a fixed number of iterations without visualization and display results for benchmarking performance.
      • In visualization mode, alongside plotting and updating particle positions at each time step, we also calculate and display a 2D histogram/heatmap of particle densities within a background grid to identify clustering behaviour.
    • In both the serial CPU and OpenMP versions of the program, we compare and contrast various implementation techniques, providing benchmarking results and optimizing for efficiency.
      • In the OpenMP version, in particular, we compare parallelisation strategies (which loops provide the most speedup when parallelised), approaches to avoid race conditions when updating the heatmap, and different thread scheduling methods.
  • Assignment 2
    • Add an efficient GPU implementation of the N-body simulation and visualisation program, carefully managing device memory allocation and data transfer, and benchmarking performance to compare the use of different GPU memory caches (where appropriate) to reduce the number of global device memory accesses, ensure good use of memory bandwidth, and avoid race conditions when updating the heatmap.

Useful references

C Programming

OpenMP

CUDA