The goal is to implement three different matrix multiply functions for three different hardware configurations (CPU - AVX/openMP, GPU - CUDA, and a cluster of two nodes - MPI). The matrices are real-only square matrices. Performance will be benchmarked relative to a reference implementation (Intel MKL for CPU and MPI, CUBLAS for GPU) for a 2048×2048 matrix, and marks will be assigned based on speed.
| Marks | CPU (AVX/openMP) 4 cores | GPU (CUDA) 1GPU | MPI 2 nodes, 4 cores each |
|---|---|---|---|
| 7 | 2.3 | 4.9 | 1.4 |
| 6 | 16.5 | 8 | 8.3 |
| 5 | 36.2 | 24.8 | 18.5 |
| 4 | 64.3 | 45.7 | 31.6 |
| 3 | 121 | 100 | 60.8 |
| 2 | 1000 (and gives correct answer and job doesn't timeout) | ||
| 1 | Compiles and runs to completion, but gives wrong answer | ||
| 0 | Doesn't compile or wasn't submitted or timeout |
- Final performance will be assessed on the vgpu10 - 0 and vgpu10 - 1 nodes of the rangpur.compute.eait.uq.edu.au cluster.
- For development, jobs can be submitted to getafix.smp.uq.edu.au.
- All nodes on rangpur have similar performance for CPU and GPU jobs and most MPI implementations. Only highly optimized MPI matrix multiply will show communication overhead.
- Random unitary square real - only matrices are created and multiplied. The result is checked for correctness and speed relative to reference implementations.
- For CPU and GPU, matrix multiplication is on the same machine. For MPI, each node has its own copy of matrices from the start, and nodes need to maintain a copy of the current matrix product answer.
- Comment out the relevant
#definelines inAssignment1_Gradebot.cppand remove relevant lines in theMakeFileif the corresponding hardware is not available.
- Only 3 files can be changed:
matrixMultiply.cpp,matrixMultiplyGPU.cu, andmatrixMultiplyMPI.cpp. - Functions must not use outside libraries (except provided headers) and must not write to stdout or file in the final submission.
- The script for final grade is
goslurm_COSC3500Assignment_RangpurJudgementDay. - For debugging, use variations of
goslurm_COSC3500Assignment_RangpurDebugorgoslurm_COSC3500Assignment_GetafixDebugscripts.
Assignment1_GradeBot.cppruns benchmarks and assigns marks.- Usage:
./Assignment1_GradeBot {matrix dimension} {threadCount} {runBenchmarkCPU} {runBenchmarkGPU} {runBenchmarkMPI} {optional integer} {optional integer}…
- The
Assignment1_GradeBotoutputs to stdout and individual text files for each benchmark on each node (COSC3500Assignment_{benchmark type}_{node}.txt). - The text files include 6 columns: Info., N, Matrices/second (MKL), Matrices/second (You), Error, Grade.
- Submission must include
matrixMultiply.cpp,matrixMultiplyGPU.cu,matrixMultiplyMPI.cppand a zip fileslurm.zip(containing slurm job output files) all zipped together in a file named{your 8 digit student number}.zip. If a required file is not implemented, submit the original blank file.