Unit 4

True and False Sharing

TODO

Principles of Parallel Algorithm Design

Granularity

Granularity is task size (amount of computation)
Depending on the number of tasks for the same problem size
- Fine-grained decomposition - large number of tasks
- Coarse grained decomposition - small number of tasks
Granularity for dense matrix-vector product
- fine-grain: each task represents an individual element in y
- coarser-grain: each task computes 3 elements in y

Degree of Concurrency

The number of tasks that can be executed in parallel
May change over program execution
Metrics:
- Maximum degree of concurrency = - Maximum number of concurrent tasks at any point during execution
- Average degree of concurrency
  - The average number of tasks that can be processed in parallel over the execution of the program
  - Speedup: serial_execution_time/parallel_execution_time
Inverse relationship of DoC and task granularity

Critical Path Length

Critical Path - The longest weighted path throughout the graph
Critical Path Length- Shortest time in which program can be executed in parallel

Limits of Parallel Performance

What bounds parallel execution time?
- minimum task granularity
- e.g. dense matrix-vector multiplication ≤ n2 concurrent tasks
- fraction of application work that can’t be parallelized
dependencies between tasks
parallelization overheads
- e.g., cost of communication between tasks
Measures of parallel performance
- speedup = T1/Tp – parallel efficiency = T1/(pTp)

OPEN MP API

OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, FreeBSD, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

Shared Memory Model
Directive Based
Easier to program & debug
Supported by gcc4.2 & higher
Open specifications for Multi Processing, based on a thread paradigm.
Introduced in 1997 by OpenMP Architecture Review Board
- v1.0-1997, v2.0-2005,v3.0-2008,v4.0-2013, v5.0 - Nov 2018.
- V5.1 released in Nov 2020 with enhancements handling interactions with non-OpenMP device accelerators like CUDA streams, HIP or OpenCL
3 primary component (Compiler Directives, Runtime Library Routines, Environment Variables).
- Extensions for Fortran, C, C++

OMP Programming Model

Shared Memory, Thread Based Parallelism.
Explicit Parallelism.
Fork-Join Model
- Execution starts with one thread – master thread.
- Parallel regions fork off new threads on entry – team thread.
- Thread join back together at the end of the region – only master thread continues.

OMP Programming Model : OpenMP is a shared memory model.

Workload is distributed between threads
Variables can be
- shared among all threads
- duplicated for each thread
Threads communicate by sharing variables
Unintended sharing of data can lead to race conditions:
- race condition: when the program’s outcome changes as the threads are scheduled differently.
To control race conditions:
- Use synchronization to protect data conflicts.
Careless use of synchronization can lead to dead-locks

OpenMP Directives (extra, not from syllabus)

Source: https://docs.microsoft.com/en-us/cpp/parallel/openmp/reference/openmp-directives?view=msvc-170

Runtime Library

int omp_get_num_threads(void);
// Returns the number of threads currently in the team executing the parallel region from which it is called

int omp_get_thread_num(void);
// Returns the thread number, within the team, that lies between 0 and omp_get_num_threads()- 1, inclusive.
// The master thread of the team is thread 0

void omp_set_num_threads(int);
// To set the number of threads
// Or modify the ENV varriable OMP_NUM_THREADS

Boilerplate Code

#include <stdio.h>
#include <omp.h>

int main() {
	int i, numt, id;
	#pragma omp parallel private (id) shared(numt)
	{
		numt = omp_get_num_threads();
		id = omp_get_thread_num();
		for (i=0; i<100; i++);
		
        printf("Output from thread %d of %d\n", id, numt);
	}
}

Building and running

$ g++ -fopenmp src/main.c
$ ./a.out
Output from thread 0 of 12
Output from thread 6 of 12
Output from thread 8 of 12
Output from thread 9 of 12
Output from thread 3 of 12
Output from thread 1 of 12
Output from thread 2 of 12
Output from thread 11 of 12
Output from thread 7 of 12
Output from thread 4 of 12
Output from thread 10 of 12
Output from thread 5 of 12

Modifying default OMP_NUM_THREADS

$ OMP_NUM_THREADS=2 ./a.out # Modify the environment variable for the program
Output from thread 0 of 2
Output from thread 1 of 2
$ export OMP_NUM_THREADS=2 # OR change it globally

Code Examples

Example 0

#include <stdio.h>
#include <omp.h>
int main()
{
    #pragma omp parallel
    {
        printf("Hello World... from thread = %d\n", omp_get_thread_num());
    }
}

Output

$ OMP_NUM_THREADS=3 make target=ex0
Hello World... from thread = 0
Hello World... from thread = 1
Hello World... from thread = 2

Example 1

#include <stdio.h>
#include <omp.h>
int main()
{
    omp_set_num_threads(6); // Set number of threads
    #pragma omp parallel
    {
        printf("Hello World... from thread = %d\n", omp_get_thread_num());
    }
}

Output

$ OMP_NUM_THREADS=3 make target=ex1 # Even though OMP_NUM_THREADS is set, the omp_set_num_threads overrides it
Hello World... from thread = 0
Hello World... from thread = 2
Hello World... from thread = 3
Hello World... from thread = 1
Hello World... from thread = 5
Hello World... from thread = 4

Example 3

#include <stdio.h>
#include <omp.h>
int main()
{
    int i, numt, id;
    #pragma omp parallel
    {
        numt = omp_get_num_threads();
        id = omp_get_thread_num();
        for (i=0;i<100;i++);
            printf(" Output from thread %d of %d\n",
        id, numt);
    }
}

Output

$ OMP_NUM_THREADS=3 make target=ex3 # 2 of 3 appears twice, race condition
 Output from thread 2 of 3
 Output from thread 0 of 3
 Output from thread 2 of 3

Example 3.1

int main()
{
    int i, numt, id;
    #pragma omp parallel private (id) shared (numt) // Declare as private / shared
    {
        // ...
    }
}

Output

$ OMP_NUM_THREADS=3 make target=ex3.1 # Consistently gives correct output, no race condition
 Output from thread 0 of 3
 Output from thread 1 of 3
 Output from thread 2 of 3

Example 4 - For loops

Note that the following two are equivalent.

Source: https://stackoverflow.com/questions/1448318/omp-parallel-vs-omp-parallel-for

#pragma omp parallel
{ 
    #pragma omp for
    for(int i = 1; i < 100; ++i)
    {
        ...
    }
}

#pragma omp parallel for
for(int i = 1; i < 100; ++i)
{
   ...
}

Distributing for iterations over the available threads

Consider the following code example (ex4 from slides). Observing the code, we can infer the following:

i runs from 0 to 2
j runs from 0 to 4, i times
printf will be called a total of 2*4 times, or 8 times
Total of 8 'tasks' to be computed over OMP_NUM_THREADS number of threads
To be precise, there are 2 serial tasks (i), which each have 4 sub-tasks (j) which can be run in parallel. Hence, the system can use a maximum of 4 threads

#include <stdio.h>
#include <omp.h>
 
int main() {
    int i=0, j=0, id;
    #pragma omp parallel private (id,i,j)         
    {                                    
        for (i=0; i<2; i++) {  
            #pragma omp for        
            for (j=0; j<4; j++) {                
                id = omp_get_thread_num();
                printf("[Thread_%d] (i,j)=(%d,%d)\n", id, i,j);
            }
        }            
    }
}

4 Threads

This is the output with 4 threads

$ OMP_NUM_THREADS=4 make target=ex4
[Thread_0] (i,j)=(0,0)
[Thread_1] (i,j)=(0,1)
[Thread_3] (i,j)=(0,3)
[Thread_2] (i,j)=(0,2)
[Thread_0] (i,j)=(1,0)
[Thread_3] (i,j)=(1,3)
[Thread_2] (i,j)=(1,2)
[Thread_1] (i,j)=(1,1)

We see the following mapping:

Thread_0 <= [ (0,0), (1,0) ]
Thread_1 <= [ (0,1), (1,1) ]
Thread_2 <= [ (0,2), (0,2) ]
Thread_3 <= [ (0,3), (1,3) ]

2 Threads

This is the output with 2 threads

$ OMP_NUM_THREADS=2 make target=ex4
[Thread_0] (i,j)=(0,0)
[Thread_0] (i,j)=(0,1)
[Thread_1] (i,j)=(0,2)
[Thread_1] (i,j)=(0,3)
[Thread_0] (i,j)=(1,0)
[Thread_0] (i,j)=(1,1)
[Thread_1] (i,j)=(1,2)
[Thread_1] (i,j)=(1,3)

We see the following mapping:

Thread_0 <= [ (0,0), (0,1), (1,0), (1,1) ]
Thread_0 <= [ (0,2), (0,3), (1,2), (1,3) ]

8 Threads

This is the output with 8 threads

$ OMP_NUM_THREADS=8 make target=ex4
[Thread_0] (i,j)=(0,0)
[Thread_1] (i,j)=(0,1)
[Thread_3] (i,j)=(0,3)
[Thread_2] (i,j)=(0,2)
[Thread_0] (i,j)=(1,0)
[Thread_3] (i,j)=(1,3)
[Thread_2] (i,j)=(1,2)
[Thread_1] (i,j)=(1,1)

We see the following mapping:

Thread_0 <= [ (0,0), (1,0) ]
Thread_1 <= [ (0,1), (1,1) ]
Thread_2 <= [ (0,2), (0,2) ]
Thread_3 <= [ (0,3), (1,3) ]
Thread_4 <= [ ]
Thread_5 <= [ ]
Thread_6 <= [ ]
Thread_8 <= [ ]

Note that even though we provide 8 threads, only 4 threads are being utilised. This is because the 2 tasks bound to each i are made to run sequentially.

Example 5

#include <stdio.h>
#include <omp.h>
 
int main() {
    int i=0, j=0, id;
    #pragma omp parallel private (id,i,j)         
    {                          
        #pragma omp for                  
        for (i=0; i<4; i++) {  
            for (j=0; j<4; j++) {                
                id = omp_get_thread_num();
                printf("[Thread_%d] (i,j)=(%d,%d)\n", id, i,j);
            }
        }            
    }
}

Output

$ OMP_NUM_THREADS=4 make target=ex5
[Thread_0] (i,j)=(0,0)
[Thread_0] (i,j)=(0,1)
[Thread_0] (i,j)=(0,2)
[Thread_0] (i,j)=(0,3)
[Thread_1] (i,j)=(1,0)
[Thread_1] (i,j)=(1,1)
[Thread_2] (i,j)=(2,0)
[Thread_2] (i,j)=(2,1)
[Thread_2] (i,j)=(2,2)
[Thread_2] (i,j)=(2,3)
[Thread_3] (i,j)=(3,0)
[Thread_3] (i,j)=(3,1)
[Thread_3] (i,j)=(3,2)
[Thread_3] (i,j)=(3,3)
[Thread_1] (i,j)=(1,2)
[Thread_1] (i,j)=(1,3)

We see the following mapping:

Thread_0 <= [ (0,0), (0,1), (0,2), (0,3) ]
Thread_1 <= [ (1,0), (1,1), (1,2), (1,3) ]
Thread_2 <= [ (2,0), (2,1), (2,2), (2,3) ]
Thread_3 <= [ (3,0), (3,1), (3,2), (3,3) ]

2D Array Access

// Array is accessed rowwise. Gives good performance
// as elements of a row will sit in the same cache line.
for (int i=0; i<n; i++)
    for (int j=0; j<n; j++)
        sum += a[i][j];

// Array is accessed column-wise. Gives bad performance
// as elements of a column will sit in the different
// cache line, resulting a cache misses.
for (int j=0; j<n; j++)
    for (int i=0; i<n; i++)
        sum += a[i][j];

Loop Interchange

Here, a nested loop access the array column-wise. In this case, interchange of loop will improve the cache accessibility.4

for (int i=0; i<n; i++)
    for (int j=0;j<n; j++)
        b[j][i] = a[j][i];
// Changed to the following
for (int j=0; j<n; j++)
    for (int i=0; i<n; i++)
        b[j][i] = a[j][i];

Loop Unrolling

// Loop overheads are relatively high when each iteration.
// Has a small number of operations.
for (int i=0; i<n; i++) {
    a[i] = b[i] + 1;
    c[i] = a[i] + a[i-1] + b[i-1];
}

// Loop has been unrolled by a factor of 2 to reduce overheads.
// We assume the iterations is divisible by 2.
for (int i=0; i<n; i+=2) {
    a[i] = b[i] + 1;
    c[i] = a[i] + a[i-1] + b[i-1];
    a[i+1] = b[i+1] + 1;
    c[i+1] = a[i+1] + a[i] + b[i];
}

Loop Fusion

/* A pair of loop that both access array a.
The second loop reuses element a[i], but by
the time it is executed , the cache line this
element is part of, may no longer be in the
cache. */
for (int i=0; i<n; i++)
    a[i] = b[i] * 2;
for (int i=0; i<n; i++) {
    x[i] = x[i] * 2;
    c[i] = a[i] + 2;
}

/* A pair of loops have been combined and the
statements are reordered. This allows the
values of a to be immediately reused while
being available in the cache. */
for (int i=0; i<n; i++) {
    a[i] = b[i] * 2;
    x[i] = x[i] * 2;
    c[i] = a[i] + 2;
}

Loop Fission

/* Loop with poor cache utilization gives poor performance.
If we can split the updates to array c from rest of the work
Loop interchange can be applied to improve cache hit. */
for (int i=0; i<n; i++) {
    c[i] = exp(i/n);
    for (int j=0; j<n; j++)
        a[j[i]] = b[j][i] + d[j] * e[i];
}

// Changed to the following
for (int i=0; i<n; i++)
    c[i] = exp(i/n);
for (int i=0; i<n; i++)
    for (int j=0; j<n; j++)
        a[j[i]] = b[j][i] + d[j] * e[i];

Loop Tiling

/* Here, a nested loop does the array transpose.
In this case, Loop interchange will not improve
The use of cache or TLB. A fresh approach is required. */
for (int i=0; i<n; i++)
    for (int j=0; j<n; j++)
        b[i][j] = a[j][i];

/* How to reduce cache misses?
Here we have used tiling. Split the inner loop into a
pair of loops. The reduces cache misses. */
for (int j1=0; j1<n; j1+=nbj)
    for (int i=0; i<n; i++)
        for (int j2=0; j<MIN(n-j1,nbj); j2++)
            b[i][j1+j2] = a[j1+j2][i];

OpemMP Shared and Private

default(shared|none)

Allows the user to specify the default scope for all of the variables in the parallel region.
If this is not specified at the beginning of the parallel region, the default is automatically shared.
Specific variables can be exempted from the default by using one of the other clauses.

#pragma omp parallel default(private), shared(a,b)

If you specify DEFAULT(NONE), you are required to declare each variable inside the parallel region explicitly.

shared(list)

Most variables are shared and these variables exist only in one memory location and all threads can read or write to that address.
Generally read only variables are shared (they can not accidently be over written) and main arrays are shared.

#pragma omp parallel default(none), shared(a,b)

They are the only variables in this region because we specified DEFAULT(NONE) first which requires that all variables be explicitly defined.

private(list)

A variable that is private can only be read or written by its own thread.
When a variable is declared as private, a new object of the same type is declared once for each thread and all references to the original object are replaced with references to the new object.
By default, loop indices are private and loop temporaries are private.

#pragma omp parallel default(none), shared(a,b), private(i,j)

Race Condition

How to avoid numt being executed by all the threads, which is one time requirement?

#include <stdio.h>
#include <omp.h>
int main()
{
    int i, numt, id;
    #pragma omp parallel private(id) shared(numt)
    {
        id = omp_get_thread_num();
        if (id==0)
            numt = omp_get_num_threads();
        for (i=0;i<100;i++);
            printf(" Race condition from thread %d of %d\n\n", id, numt);
    }
}

Output:

$ make target=race_par2
Race condition from thread 1 of 0
Race condition from thread 0 of 4
Race condition from thread 2 of 0
Race condition from thread 3 of 4

AdityaNG/OpenMP_Notes

Unit 4

True and False Sharing

Principles of Parallel Algorithm Design

Granularity

Degree of Concurrency

Critical Path Length

Limits of Parallel Performance

OPEN MP API

OpenMP Directives (extra, not from syllabus)

Runtime Library

Boilerplate Code

Code Examples

Example 0

Output

Example 1

Output

Example 3

Output

Example 3.1

Output

Example 4 - For loops

Distributing for iterations over the available threads

4 Threads

2 Threads

8 Threads

Example 5

Output

2D Array Access

Loop Interchange

Loop Unrolling

Loop Fusion

Loop Fission

Loop Tiling

OpemMP Shared and Private

default(shared|none)

shared(list)

private(list)

Race Condition