Parallel CPU and GPU high-performance computing course
This short course aims at providing an interactive and applied approach in an hands-on format to parallel and high-performance computing in Julia. This short course covers trendy areas in modern geocomputing. Seeking at solutions of differential equations requires efficient numerical schemes optimally leveraging modern hardware. These solvers permit to resolve scientific problems that where technically not possible a decade ago.
The goal of this short course is to offer an interactive and tutorial-like hands-on to solve systems of differential equations in parallel on many-core hardware accelerators such as GPUs using the Julia language. Julia combines high-level language simplicity to low-level language performance. The resulting codes and applications are fast, short and readable [1, 2, 3].
The iterative algorithms can be converted into efficient linear and non-linear solvers relying on a second order Richardson type of iterations strategy [4].
- Objectives
- Pre-requisite
- Material
- Getting started
- Course outline
- Advanced start
- Extras
- Further reading
We will design and implement an iterative numerical algorithm that resolves (non-linear) diffusion in 2D for two applications:
- The diffusion of heat:
dT/dt = 1/ρCp*(-dqx/dx -dqy/dy)
qx = -λ*dT/dx
qy = -λ*dT/dy
For an initial Gaussian distribution, the heat diffusion code produces following output:
- The non-linear diffusion of ice topography (simplified shallow-ice):
dH/dt = -dqx/dx -dqy/dy + b
qx = -H^n*dH/dx
qy = -H^n*dH/dy
For an initial Gaussian distribution of ice and a circular and centred source/sink term, the simplified shallow-ice code produces following output while reaching a steady state:
These two examples will enable to address the technical objectives of this course (3 parts).
Part 1 | We will use (1) as playground to address:
- vectorised plain Julia implementation CPU (idem as python, Matlab, Octave)
- vectorised versus loop versions on CPU
- "kernel"-style loops and multi-threading multi-core CPU
- vectorised plain Julia with GPU backend (similar to abstract GPU functions for e.g. python, Matlab)
- explicit "kernel"-style GPU (Julia's power: C CUDA low-level handles)
- enabling single/multi-XPU (both multi-core CPU and GPU) using ParallelStencil.jl
Part 2 | We will use (2) as playground to address:
- explicit time stepping
- transient, steady-state solutions
- explicit vs implicit solutions
Part 3 | We will use (1) and (2) to address:
- distributed memory parallelisation ("fake" and "real" parallelisation)
- local and global domain, internal and global boundaries, initial condition, boundary update, synchronisation
- visualisation and I/O
- message passing interface (MPI), MPI + GPU (CUDA-aware MPI)
- communication/computation overlap (hide communication)
- using ImplicitGlobalGrid.jl for high-level implementation along with ParallelStencil.jl
The hands-on format prioritises the learning-by-doing thus not much preliminary knowledge is required. Basic programming skills won't hurt though. The course will build upon the use of the Julia programming language.
Majority of stencil based codes as in this course are memory bounded, meaning the limiting factor in performance is the rate at which memory is transferred from and back between the memory and the arithmetic units. The maximal rate at which the memory transfers occur is the memory copy rate, in the order of 50 GB/s for CPUs and about 1 TB/s for modern GPUs. The effective memory throughput metric (T_eff) measures how good an iterative stencil-based algorithm performs in terms of memory throughput, to be compared to the memory copy rate. The T_eff formula reads: T_eff = (nIO/1e9*nxy*PRECIS)/(time_s/nt)
, where nIO
is the number of read/write operations (2 for an update rule), nxy
is the numerical grid resolution, PRECIS
is the arithmetic precision (8 or 4 bytes per number), time_s
is the execution time in second to perform nt
iterations [1].
On the CPU, multi-threading is made accessible via Base.Threads and the environment variable JULIA_NUM_THREADS can be used to define the number of cores to use on the CPU, e.g. export JULIA_NUM_THREADS=2
to enable 2 threads (2 CPU cores). The CUDA.jl module permits to launch compute kernels on Nvidia GPUs within Julia. JuliaGPU provides further reading and introductory material about GPU ecosystem within Julia.
The course material contains some ready-to-run example scripts, draft tmp scripts to be complete as tasks during the course and their corresponding solution scripts.
- example scripts | The active working directory for the course will be scripts, that contains the example scripts and the tmp scripts to work on.
- solution scripts | All tmp scripts have their corresponding solution scripts located in solutions
If it applies, follow the instructions provided on the course's private channel.
In general, clone this repo (or download it otherwise) to run the example scripts and access the draft scripts to be completed during the course. Solutions or "cheat-sheets" can be found in the solutions folder. The examples rely on 3 main Julia modules, Plots.jl
(and PyPlot.jl
) and CUDA.jl
. The XPU examples require ParallelStencil.jl to be installed. The MPI examples require MPI.jl
to be installed. The multi-XPU scripts require ImplicitGlobalGrid.jl to be installed.
There are two ways of executing a Julia script, from the Julia command window known as the Julia REPL, or from the terminal shell directly. The MPI and multi-XPU examples need to be executed from the terminal shell.
To run Julia interactively, start Julia from the shell (or Terminal). Go to the geo-hpc-course
folder. Then start Julia appending the --project
flag to gain access to the required modules:
$ cd <path-to>/geo-hpc-course/
$ julia --project
Then, in the Julia REPL, you can execute the script as following:
julia> include("<my_script>.jl")
💡 Note that typing
;
in the Julia REPL permits you to execute shell commands (likecd ..
).
For optimal performance (like measuring T_eff) and for running Julia MPI, run Julia as executable from the shell directly, using the optimisation flag -O3
and disabling bound checking --check-bounds=no
as following:
$ julia --project -O3 --check-bounds=no <my_script>.jl
Note that interactive plotting may fail then.
Set the default viz = false
flag to true
if you want to plot output in all codes beyond step 2.
This section is about launching a Julia MPI script. For MPI.jl install notes, refer to the Advanced start - Julia MPI section and the MPI.jl doc. In the proposed approach, each MPI process will handle one CPU thread. In the MPI GPU case (multi-GPUs), each MPI process handles one GPU.
Assuming a working Julia MPI installation, a Julia MPI program can be launched using the Julia MPI wrapper mpiexecjl
(located in ~/.julia/bin
).
Running the Julia MPI hello_mpi.jl
script on 4 processes can be achieved following:
$ mpiexecjl -n 4 julia --project scripts/hello_mpi.jl
$ Hello world, I am 0 of 3
$ Hello world, I am 1 of 3
$ Hello world, I am 2 of 3
$ Hello world, I am 3 of 3
The 2D Julia MPI diffusion script heat_2D_mpi.jl
executed on 4 MPI processes (global grid: 2x2) produces the following output (see Extras for infos about the gif-making scripts).
Note: The presented concise Julia MPI scripts are inspired from this 2D python script.
Advanced documentation on running the multi-XPU codes can be found in the ParallelStencil.jl module documentation.
During the course, we will go through the following steps:
- Intro part 1
- TODO Finalise the 1D heat diffusion code
heat_1D_tmp.jl
, implementing the equations from Objectives 1. - See how the diffusion looks like in 2D
heat_2D.jl
. - TODO Finalise the 2D loop version of the heat code
heat_2D_loop_tmp.jl
. - TODO Import the flux and
T
loop calculations in the heat code using external "kernel"-like compute functionsheat_2D_loop_fun_tmp.jl
. - See how the function-based loop version (5.) can be further extended to checking bounds with
if
statement in theix
andiy
loops, including "loop-fusion" for the flux computationsheat_2D_loop_fun_gpustyle.jl
. - See how one can simply use the GPU to perform the 2D heat diffusion calculations
heat_2D_gpu.jl
. - TODO The performance "magic"; update the script
heat_2D_gpu_fun_tmp.jl
based on previous knowledge and step (5.). - See how steps 5., 6. and 7. can be combined into a single code using ParallelStencil.jl in
heat_2D_xpu.jl
- Discussion on CPU vs GPU architectures and performance evaluation (T_eff). Q&A.
- Intro part 2
- TODO Based on your acquired experience, finalise the
sia_2D_tmp.jl
script to convert the heat diffusionT
into an ice cap thicknessH
evolution over time. - TODO Modify the the script from (12.) to have an implicit solver while reaching a steady-state solution.
- TODO (NEW!) You demystified GPU computing with completing step 9. Update now the script
sia_2D_xpu_tmp.jl
to have an XPU (CPU or GPU) code ready! - Discussion about pseudo-transient solvers, damping and convergence. Q&A.
- Intro part 3 (NEW!)
- TODO Finalise the 1D heat diffusion code
heat_1D_2procs_tmp.jl
, splitting the calculation of temperature evolution on one left and one right domain. This "fake-parallelisation" requires left and right temperature arrays,TL
andTR
, respectively. - TODO Generalise the "fake-parallel" approach on 2 processes to
n
processes by modifying the codeheat_1D_nprocs_tmp.jl
, taking care of implementing the initial condition, heat diffusion physics and the boundary update. - The script
hello_mpi.jl
shows a "Hello World" example implementing MPI.jl. Use this script to test your MPI.jl install (see the Running Julia MPI section for more infos on installing and running Julia MPI). - Discover a concise MPI 1D heat diffusion example
heat_1D_mpi.jl
. Learn about the minimal requirements to initialise a Cartesian MPI topology and how to code the boundary update functions (here using blocking messages). Use thevizme1D_mpi.jl
script to visualise the results (each MPI process saving it's local output). - TODO Yay, you have your MPI 1D Julia script running! Finalise the MPI 2D heat diffusion script
heat_2D_mpi_tmp.jl
to solve the 2D diffusion equation using MPI. Use thevizme2D_mpi.jl
script to visualise the results (each MPI process saving it's local output). - Now that you demystified distributed memory parallelisation, see how using ImplicitGlobalGrid.jl along with ParallelStencil.jl leads to concise and efficient distributed memory parallelisation on multiple XPUs in 2D
heat_2D_multixpu.jl
. Also, take a closer look at the @hide_communication feature. Further infos can be found here. - TODO Instrument the 2D shallow ice code sia_2D_xpu.jl (task 14.) to enable distributed memory parallelisation using ImplicitGlobalGrid.jl along with ParallelStencil.jl.
💡 Use
sia_2D_xpu.jl
for a quick start, andsia_2D_multixpu.jl
for a solution.
- Yay, you made it - you demystified running Julia codes in parallel on multi-XPU :-) Q&A.
Steps already done on the GPU server you are running on (CentOS 8 linux)
Starting in the shell:
$ wget https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6.2-linux-x86_64.tar.gz
$ tar -xzf julia-1.6.2-linux-x86_64.tar.gz
$ vim ~/.bashrc # Add the line: PATH=~/julia-1.6.2/bin/:$PATH
$ cd <path-to>/geo-hpc-course/
$ julia --project
Once Julia launched, enter the package mode ]
and instantiate
the project. This will download all the project's dependencies and install them:
julia> ]
(geo-hpc-course) pkg> instantiate
julia>
If your GPU system contains more than one GPU, you can add following at the beginning of each gpu
named code to target a specific device identified by its unique ID
(default being 0
):
GPU_ID = ID
CUDA.device!(GPU_ID)
The following steps permit you to install MPI.jl on your machine:
- Add
MPI.jl
:
(project) pkg> add MPI
julia> using MPI
[ Info: Precompiling MPI [da04e1cc-30fd-572f-bb4f-1f8673147195]
julia> MPI.install_mpiexecjl()
[ Info: Installing `mpiexecjl` to `HOME/.julia/bin`...
[ Info: Done!
-
Then, one should add
HOME/.julia/bin
to PATH in order to launch the Julia MPI wrappermpiexecjl
. -
Running the Julia MPI code on 4 processes:
$ HOME/.julia/bin/mpiexecjl -n 4 julia --project scripts/hello_mpi.jl
💡 Note: On MacOS, there seems to be an issue (JuliaParallel/MPI.jl#407). To fix it, define following
ENV
variable:
$ export MPICH_INTERFACE_HOSTNAME=localhost
and add
-host localhost
to the execution script:
$ HOME/.julia/bin/mpiexecjl -n 4 -host localhost julia --project scripts/hello_mpi.jl
Julia supports UTF-8 (Unicode) characters. Also, the plotting package Plots.jl permits to create gif animation out-of-the-box. The /extras/heat_2D_gif_unicode.jl exemplifies these two fun capabilities.
The code /extras/sia_2D_ss_gif.jl uses the out-of-the-box gif-making capabilities to produce the SIA non-linear diffusion gif.
The code /extras/heat_2D_mpi_gif.jl produces time-dependent output to be visualised as a gif using the /extras/vizme2D_mpi_gif.jl script.
💡 Note: On Linux machines, emoji keyboard may need to be installed in order to display the Unicode emoticons.
$ sudo dnf install google-noto-emoji-color-fonts.noarch