High Performance Computing

University projects using OpenMP, MPI (C & Python), CUDA

General Requirements
OpenMP
MPI (C-Lang)
mpi4py
- Usage
Cuda
CuPy

General Requirements

gcc & g++ compilers
MPI, CUDA libraries

OpenMP

DDA

2 Threads running in parallel, one starting from the initial start point of the line (red line), the other starting from the end point (green line).

Extra Instructions

In order to run & see output in your cmd (Windows) or command line (Linux) you need Borland Graphics Interface (if running on Windows) or SDL graphics library if running on Linux system. See instructions for Windows & Linux.

Usage:

dda_seq.c

g++ dda_seq.c -lbgi -lgdi32 -lcomdlg32 -luuid -loleaut32 -lole32 -o dda

Windows

dda

Linux

./dda

dda_parallel_midpoint.c

g++ dda_parallel_midpoint.c -lbgi -lgdi32 -lcomdlg32 -luuid -loleaut32 -lole32 -fopenmp -o dda_p

Windows

dda_p

Linux

./dda_p

Screenshot

Chaos

Used omp parallel for in draw triangle's for loop. Parallel point drawing. See more about Chaos Game

Extra Instructions

Usage:

chaos_seq.c

g++ chaos_seq.c -lbgi -lgdi32 -lcomdlg32 -luuid -loleaut32 -lole32 -fopenmp -o chaos

Windows

chaos

Linux

./chaos

chaos_parallel.c

g++ chaos_parallel.c -lbgi -lgdi32 -lcomdlg32 -luuid -loleaut32 -lole32 -fopenmp -o chaos_p

Windows

chaos_p

Linux:

./chaos_p

Screenshot

char_freq

Counts the frequency of each ASCII character in a txt file (as input). Parallelized with 7 different ways

Using Global Array of Locks
Using Local Array of Locks
Using Atomic Operation (Global)
Using Atomic Operation (Local)
Using Critical Operation (Global)
Using Critical Operation (Local)
Using Reduction

Usage

gcc -fopenmp char_freq_parallel_<method>.c -o char_freq_p

Windows

char_freq_p

Linux

./char_freq_p

Execution Time

Using as input the bible.txt file.

count_sort

Parallelized count sort algorithm (enumeration sort), using omp parallel for. Change the N macro inside the code.

Usage

gcc -fopenmp count_sort_parallel.c -o cs_p

Windows

cs_p

Linux

./cs_p

Execution Time

Tested both with random array, size 90000.

insertion_sort

Parallelized insertion sort algorithm, using omp parallel and critical block (lock). Change the N macro inside the code. Sequential calculaton is even faster.

Usage

gcc -fopenmp insertion_sort_parallel.c -o is_p

Windows

is_p

Linux

./is_p

Execution Time

Tested both with random array, size 500000.

As you can see, no real difference in time. This is most likely due to critical block, each thread waits every time to get the lock.

epsilon

Parallelized calculation of epsilon (in mathematics). Used omp parallel reduction.

Usage

gcc -fopenmp epsilon_parallel.c -o epsilon_p

Windows

epsilon_parallel

Linux

./epsilon_parallel

Execution Time

As you can see, no real difference in time. Sequential calculaton is even faster.

pi

pi_parallel_array

Every thread writes to different element of a global array. When all threads are finished (barrier), add all elements' values to variable.

Usage

gcc -fopenmp pi_parallel_array.c -o pi_parallel_array

Windows

pi_parallel_array

Linux

./pi_parallel_array

pi_parallel_atomic

Each thread writes to global variable (protected with atomic operation).

Usage

gcc -fopenmp pi_parallel_atomic.c -o pi_parallel_atomic

Windows

pi_parallel_atomic

Linux

./pi_parallel_atomic

pi_parallel_critical

Each thread writes to global variable (protected with critical section).

Usage

gcc -fopenmp pi_parallel_critical.c -o pi_parallel_critical

Windows

pi_parallel_critical

Linux

./pi_parallel_critical

pi_parallel_critical

Each thread writes to local variable and then adds its result to global variable (pi).

Usage

gcc -fopenmp pi_parallel_local.c -o pi_parallel_local

Windows

pi_parallel_local

Linux

./pi_parallel_local

pi_parallel_reduction

Each thread writes to global variable (using OpenMP's reduction operation).

Usage

gcc -fopenmp pi_parallel_reduction.c -o pi_parallel_reduction

Windows

pi_parallel_reduction

Linux

./pi_parallel_reduction

Execution Time

As we can observe, reduction works better compared to other methods (pi_parallel_local is reduction, pi_parallel_reduction uses OpenMP's reduction operation).

MPI

C

char_freq-mpi

SPMD: Each thread takes a specific slice of the file of characters and stores the character frequency to local array (each process has its own frequency array).

Using the MPI_Reduce function (with MPI_SUM operation), all local frequency arrays sums to root's (process 0) total frequency array.

/* Make the reduce */
MPI_Reduce(freq, total_freq, N, MPI_INT, MPI_SUM, ROOT, MPI_COMM_WORLD);

Source: https://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/

Usage

mpicc char_freq_parallel_Reduce.c -o char_freq

mpirun -np 4 char_freq // 4 cores

Execution Time

count_sort-mpi

Reduce

SPMD: Each process is aware of the whole array and sorts a specific part of the array (start, stop). Using the MPI_Reduce function (with MPI_SUM operation), each process returns its results to sorted_array.

Usage

mpicc count_sort_parallel_Reduce.c -o cs_r

mpirun -np 4 cs_r // 4 cores

Locations

Each process has a local array which contains indexes on where the ith element (i=0,1,...,N) should be put, in order the final array to be sorted. For example, if local_locations[2] == 0, then the element in position 2 of the initial array should be put in position zero.

mpicc count_sort_parallel_Locations.c -o cs_l

mpirun -np 4 cs_l // 4 cores

Execution Time

All tests with N = 400000.

insertion_sort-mpi

Parallelized insertion sort algorithm, using pipeline. N is standard (the number of processors).

Usage

mpicc insertion_sort_parallel_Pipeline.c -o is

mpirun -np 4 is // 4 cores

shell_sort

Parallelized shell sort algorithm, using the method of Computer Science and Engineering, University at Buffalo.

Usage

mpicc shell_sort_parallel_Reduce.c -o shell

mpirun -np 4 shell // 4 cores

Execution Time

sieve

Parallelized Sieve of Eratosthenes algorithm for finding prime numbers in specific range. Used Pipeline (2 alternatives, first is slow, for big N it actually never ends), global array (bitmap alike) and MPI_Scan directive.

Pipeline (Alternative 1)

Usage

mpicc sieve_parallel_Pipeline.c -o sieve_pipe

mpirun -np 4 sieve_pipe // 4 cores

Pipeline (Alternative 2)

Usage

mpicc sieve_parallel_Pipeline_2.c -o sieve_pipe_2

mpirun -np 4 sieve_pipe_2 // 4 cores

Global Locations

Usage

mpicc sieve_parallel_Locations_Global.c -o sieve_g

mpirun -np 4 sieve_g // 4 cores

MPI Scan

Usage

mpicc sieve_parallel_Locations_MPI_Scan.c -o sieve_scan

mpirun -np 4 sieve_scan // 4 cores

Execution Time

Tested for N = 200 (below).

mpi4py

All code from MPI is written in Python, too, using mpi4py lib.

mpi4py-Usage

python <name_of_py_file>

dimizisis/high-performance-computing

High Performance Computing

Table of Contents

General Requirements

OpenMP

DDA

Extra Instructions

Usage:

dda_seq.c

Windows

Linux

dda_parallel_midpoint.c

Windows

Linux

Screenshot

Chaos

Extra Instructions

Usage:

chaos_seq.c

Windows

Linux

chaos_parallel.c

Windows

Linux:

Screenshot

char_freq

Usage

Windows

Linux

Execution Time

count_sort

Usage

Windows

Linux

Execution Time

insertion_sort

Usage

Windows

Linux

Execution Time

epsilon

Usage

Windows

Linux

Execution Time

pi

pi_parallel_array

Usage

Windows

Linux

pi_parallel_atomic

Usage

Windows

Linux

pi_parallel_critical

Usage

Windows

Linux

pi_parallel_critical

Usage

Windows

Linux

pi_parallel_reduction

Usage

Windows

Linux

Execution Time

MPI

C

char_freq-mpi

Usage

Execution Time

count_sort-mpi

Reduce

Usage

Locations

Execution Time

insertion_sort-mpi

Usage

shell_sort