/spm-18

Homework of the Parallel and Distributed Systems: Paradigms and Models course of the Computer Science and Networking Master's Degree @ University of Pisa

Primary LanguageC++

spm-18: exercises and final project

Homework and final project of the Parallel and Distributed Systems: Paradigms and Models course of the Computer Science and Networking Master's Degree @ University of Pisa.

Homework number Language/Framework/Tool Description
1 C++ A program that given a stream of std::vector<double> elements applies two functions (f1(x) = x + 1 and f2(x) = 2 * x) over all the items of each vector in the stream. The implementation uses only C++ base mechanisms and libraries.
The computation has been implemented both in a sequential way and in a parallel way. The parallel version has been implemented as a four stage pipeline on a shared memory multicore: STAGE1 generates a stream of m vectors of n items each (randomly filled), STAGE2 increases all the items in each input vector, STAGE3 doubles all the items in each input vector, STAGE4 prints the input vectors contents on screen. The performance has been measured in terms of scalability, speedup and efficiency.
2 C++ A program that computes in parallel a set of independent tasks, initially stored in a shared data structure, and delivers results using a second shared data structure. The program has been implemented using only C++ standard mechanisms and threads. An input task is given by an integer number N and the result to be computed is the number of prime numbers included in range [1, N]. The initial set of tasks is picked up randomly in the range [1, 10K].
Two different implementations of the workers has been provided:
  • SEQUENTIAL: a worker first checks in the collection of the results if its current task has already been computed by another worker: if yes its computation is done, otherwise it will look for the biggest task computed since now that is smaller than its task and computes the number of primes in a sub interval of [1, N] only.
    Example: worker 0 receives task 100 and it has to compute the number of prime numbers in the interval [1, 100]; he finds out that task 100 has not been computed yet and starts looking for the biggest already computed task that is smaller than 100 (assume it is 93); he finds that task 93 has already been computed so it will only count the number of prime numbers in the range [94, 100] and its final result will be the sum of the number of prime numbers in [1, 93] and the number of prime numbers in [94, 100].
  • SEQUENTIAL OPTIMIZED: each worker implements the Sieve of Eratosthenes to compute the number of prime numbers between 1 and the value of its received task.
Some delay can be added in the procedure to obtain a task to be computed in order to observe the impact on scalability.
Load balancing of the workers has been implemented as well.
3 C++, FastFlow A program that finds the prime numbers in a range [n1, n2]. Four implementations have been provided:
  • SEQUENTIAL: sequential computation of the prime numbers in the interval [n1, n2].
  • PARALLEL (MASTER-WORKER v1): implementation of the master-worker parallel schema (farm + feedback) where the task granularity is one number in the range [n1, n2] (the emitter distributes the tasks to the workers, each of them checks if the received number is prime and if so it sends it back to the emitter).
  • PARALLEL (MASTER-WORKER v2): implementation of the master-worker parallel schema (farm + feedback) where the task granularity is a sub-interval of the range [n1, n2] (the emitter distributes the tasks to the workers, each of them computes sequentially on the received sub-interval and sends back to the emitter the list of prime numbers in its sub-range).
  • PARALLEL (PARALLEL FOR): implementation of the parallel for pattern where the initial range [n1, n2] is partitioned in chunks each one computed by a worker.
The implementation of the parallel versions is done using FastFlow.

Final project

The application implements the parallel scan algorithm developed by Guy E. Blelloch. The schema adopted in the parallel implementation is the master-worker one, where a distributor module acts as both scatter and gather and several worker modules do the computation in parallel. A performance model has been derived by analyzing the complexity of the computation phases and it is used in the tests in order to have an estimation of the time needed to complete the computation. Actual performances are measured on a Xeon Phi KNL machine with 64 cores (256 threads) and compared with the expected values deducted from the model.

Implementation

Two implementations are provided, one that uses C++ threads and mechanisms only and the other that exploits low level Fast Flow building blocks. Both share the common structure of a class that given an input vector, an associative operation and its identity value computes the output vector containing the result of the scan.

Test

Three tests are provided:

  1. The first test executes the parallel prefix class on a vector of integers using the addition as associative operation
  2. The second one executes the parallel prefix class on a vector of integers using the multiplication as associative operation
  3. The third one executes the parallel prefix class on a vector of strings using the concatenation as associative operation