2020/2021 - 4th Year, 2nd Semester
Course: Computação Paralela (CPAR) | Parallel Computing
Authors: David Silva (daviddias99), Luís Cunha (luispcunha)
Description: For the second project we talked algorithm parallelization. To do this we utilized different algorithms: matrix multiplication and LU factorization. We utilized different frameworks to achieve parallelism: OpenMP (CPU), Sycl (CPU and GPU) and CUDA (baseline for GPU). We measured the performance of these algorithms for different matrix and block sizes.
For info on the proposed work on docs/specification.pdf
and on our results in docs/report.pdf
.
Technologies: C/C++, OpenMP, Sycl, CUDA
Skills: Algorithm parallelization, algorithms, cache efficiency, matrix multiplication, blocking approach
Grade: 18.4/20
Previous work: Link for the first project
This project uses OpenMP, CUDA and Sycl. For the Sycl implementation we mainly used ComputeCpp, however, due to the lack of support of CUDA devices we used the DPC++ implementation to compile the Sycl code for an Nvidia GTX 1060.
You can run the compile_all.py
script, or if you want to compile them individually:
To compile the OpenMP code simple use the following command:
g++ -O2 <file_name> -fopenmp
Both the matrix multiplication and the LU decomposition OMP programs receive the following arguments: matrix size, operation, number of runs and block size.
To compile the CUDA version use the following command.
nvcc <file_name>
The CUDA program receives the following arguments: matrix size, operation, number of runs and block size.
If the goal is to compile the code to the CPU or to a GPU that is supported by ComputeCpp then we just need to execute the following command on a folder that contains a Makefile
:
make <file_name_no_extension>
However, if we need to compile the code for a CUDA GPU we must have the DPC++ implementation of Sycl and we can use the following command:
clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda-sycldevice <file_name>
The Sycl program receives the following arguments: matrix size, block size, operation, device (gpu, cpu or manual to interactively choose the device), number of runs and block size.
cuda/matmul.cu
1 - Matrix multiplication using blocks and local memory (static array creation), block size defined by the TILE_WIDTH
macro.
2 - Matrix multiplication using blocks and local memory (dynamic array creation)
3 - Matrix multiplication using blocks
omp/lu/lu.cpp
1 - LU naive
2 - LU blocks
3 - LU OMP Data Parallel
4 - LU OMP Functional Parallel
omp/matmul/matmul.cpp
1 - MM naive sequential
2 - MM OMP without collapse directive
3 - MM OMP with collapse directive
sycl/lu/lu.cpp
1 - LU factorization
sycl/matmul/matmul.cpp
1 - MM naive
2 - MM Blocks without local memory
3 - MM Blocks with local memory