APPLICATIONS: ============= ------------------------ swps3 (release 20080605) ------------------------ swps3 is an optimised version of the striped Smith-Waterman local alignment algorithm for both the Cell/BE and �×86/x86_64 CPUs with SSE2 instructions. swps3 has been developed at the Computer Science Department of the ETH Z�ürich. The project website is : http://www.inf.ethz.ch/personal/sadam/swps3/ We modified the original code in order to use the FastFlow framework. The new version (swps3_ff) can be compiled this way: #cd swps3 #make -f Makefile.sse2_ff Command line parameters are: swps3_ff [-h] [-s] [-j num] [-i num] [-e num] [-t num] matrix query db matrix: scoring matrix file query : query amino acid sequence in FASTA format db : amino acid sequence data base in FASTA format -h : print help -s : run the scalar version (without vectorised SSE instructions) -j num: start num worker threads -i num: gap insertion score (default: 12) -e num: gap extension score (default: 2) -t num: score limit (default: DBL_MAX) You can find the Swiss-Prot protein DB in FASTA format here: http://www.uniprot.org/downloads ------------------------ simple_mandelbrot ------------------------ Simple Mandelbrot computes and displays the Mandelbrot set. The application has been thought as a naive example of an embarrassingly parallel application on a stream. The mandebrot set is represented as a matrix of pixels. The coordinates of the lines of the complex plane are dispatched by a scheduler thread, which "create a stream" and send its items (lines) to workers in a round robin fashion. Workers allocate a raw of pixels, computes the Mandelbrot set for line and then send the raw of pixel to a collector, which gather the lines from all workers and display them on the screen. The application has been realised in different version: sequential (mandel_seq), pthread-based (mandel_pt), using fastflow (mandel_ff), using fastflow with FastFlow's memory allocator (mandel_ff_mem_all). All applications use the same sequential code, which has been manually copy&pasted. Performances are very dependent on the graphic subsystem. So if you want to test performance, please compile all versions with MANDEL_NO_DISPLAY environment variable set (see Makefile). In our tests, the lesser the number of iterations, the higher the FastFlow performance is with respect to the mandel_pt version. Command line parameters are: mandel_<ver> size niterations retries nworkers [template-type] size : size of the pixel matrix niterations : max number of iterations to compute each point of the set retries : number of times the experiment is run nworkers : number of workers used in the computation (threads = nworkers+1) template-type: 0 force not to use the Collector, 1 force to use the Collector Default values are: 800 1024 1 2 1 Note that with less than 4 cores, the template-type parameter is set to 0 if not differently specified in the command line. ------------------------ qt_mandelbrot ------------------------ The "QT Mandelbrot" is an interactive application that computes the Mandelbrot set. It is part of the Trolltech QT examples and it consists of two classes: RenderThread.cpp, i.e. a QThread subclass that renders the Mandelbrot set, and MandelbrotWidget.cpp class, i.e a QWidget subclass that shows the Mandelbrot set on screen and lets the user zoom and scroll. During the time where the worker thread is recomputing the fractal to reflect the new zoom factor position, the main thread simply scales the previously rendered pixmap to provide immediate feedback. The result does not look as good as what the worker thread eventually ends up providing, but at least it makes the application more responsive. These two instances of the classes are run as QT threads; it shows how to use a worker thread to perform heavy computations without blocking the main thread's event loop. The application is multithreaded (exploiting QT threads) but threads are not used to boost heavy computations since the whole computation is done within a single thread, but to decouple two or more activities and to enhance responsiveness. Despite it an easy application, the full-fledged parallelization of the whole application is not trivial. The two threads synchronise one each other via QT events; to guarantee responsiveness the MandelbrotWidget thread may start, restart, and abort the activity of RenderThread. This kind of behavior, as well as the integration with other threading libraries, makes non trivial the porting to frameworks such as TBB and OpenMP. The Fastflow accelerated version just make parallel the RenderThread by using a farm accelerator on the outer loop that traverse the pixmap of the Mandelbrot set. The farm accelerator is created once, then run and frozen each time an interrupt signal is raised from MandelbrotWidget. ------------------------ nqueens ------------------------ Th N-queens is a problem in which you have to place N queens on an NxN board-size chess such that no queen can attach the others. The proposed solution is based on a heavily optimised sequential C program written by Jeff Somers (the original version is implemented in the file nq.c) using the FastFlow farm skeleton without the Collector module. The FastFlow version produces a stream of independent tasks corresponding to the initial placement of a number of queens on the board (such number corresponds to the command line argument 'depth'). The placement of the remaining queens, starting from the initial placement received as a task from the input stream, is handled by one of the worker threads. Performance: Machine: Duad quad-cores Xeon E5420 @2.5 GHz (8-cores). ^^^^^^^ Command: nq_ff_acc <board-size> 8 4 board-size # solutions seq. time FastFlow Acc. Speedup (nq) (nq_ff_acc) 18x18 666,090,624 6:52 1:06 6.24 19x19 4,968,057,848 53:28 8:26 6.34 20x20 39,029,188,884 7:16:27 1:08:56 6.52 21x21 314,666,222,712 ~2.7days 9:48:28 6.69 Machine: Dual quad-core Xeon E5520 Nehalem @2.26GHz (8-cores 16-contexts) ^^^^^^^ Command: nq_ff_acc <board-size> 16 4 18x18& 666,090,624& 5:53 34 10.4 19x19& 4,968,057,848& 44:56 4:23 10.2 20x20& 39,029,188,884& 6:07:21 35:41 10.3 21x21& 314,666,222,712& ~2.2days 5:07:19 10.3 Machine: Dual D12-cores Opteron 6174 @ 2.2GHz (12-cores 2-CPUs) ^^^^^^^ Command: nq_ff_acc <board-size> 24 4 18x18& 666,090,624& 5:32 18 19x19& 4,968,057,848& 43:06 1:13 20x20& 39,029,188,884& 5:50:32 17:46 21x21& 314,666,222,712& --- 2:31:58 -- Command line parameters of the FastFlow version are: nq_ff_acc <width of board> [<num-workers> <depth>] width of board: problem size num-workers : number of workers thread used in the computation (default: 2) depth : initial values of queens to place (default: 1) ------------------------ cholesky ------------------------ The cholesky application performs the Cholesky decomposition on a stream of complex hermitian matrices. Cholesky factorization, or Cholesky decomposition, is a way to decompose a symmetric, positive-definite hermitian matrix into the product of a lower triangular matrix and its conjugate transpose. If a complex matrix A is Hermitian and positive definite, then A can be decomposed as A = LL*, where L is a lower triangular matrix with strictly positive diagonal entries, and L* denotes the conjugate transpose of L. The Cholesky decomposition is mainly used for the numerical solution of linear equations Ax = b. We can solve Ax = b by first computing the Cholesky decomposition A = LLT, then solving Ly = b for y, and finally solving LTx = y for x. There are various methods for calculating the Cholesky decomposition. We have implemented two of them. The first one is the standard version deeply explained in http://en.wikipedia.org/wiki/Cholesky_decomposition. The second one is the block version, which considers the matrix as divided in sub-matrices: the algorithm procedes working on these sub-matrices instead of single elements. A deep explanation of the block variant of Cholesky decomposition can be found in "Programming Matrix Algorithms-by-Blocks for Thread-Level Parallelism" by Gregorio Quintana-Ort�ì et al. We implemented a farm parallelization of both these versions. The emitter distributes the tasks to the workers. Each task is represented by a pointer to an input matrix A and a pointer to a result matrix L. There is no collector filter process: each worker reads inputs and saves results directly on main memory. .......to be completed..... ------------------------ matmul ------------------------ Simple NxN integer matrix multiplication (A=BxC). The FastFlow version proposed splits the computation in NxN tasks in order to compute the C[i][j] element. Better performance could be obtained splitting the computation in N tasks allowing each worker threads to compute an entire row of the C matrix. Performance for N=1024. Command: ff_matmul <nworkers> Machine: Dual quad-core Xeon E5520 Nehalem @2.26GHz (8-cores 16-contexts) ^^^^^^^ Sequential time 10780.9 (ms) <nworkers> FastFlow time (ms) OpenMP time (ms) (gcc 4.4.3) 4 2982.24 2767.48 8 1495.87 1455.73 12 1496.01 1492.87 16 1495.74 1433.31 Machine: 32-cores AMD machine (4 CPUs 8 cores x CPU) ^^^^^^^ Sequential time 22324.1 (ms) <nworkers> FastFlow time (ms) OpenMP time (ms) (gcc 4.3.2) 4 12173.5 15286.4 8 11116.2 13725.2 12 9844.62 8607.82 16 5890.27 7621.7 20 5518.41 7490.09 24 5516.41 6420.61 28 5194.5 6327.92 32 5161.4 6043.53 Machine: Dual D12-cores Opteron 6174 @ 2.2GHz (12-cores 2-CPUs) ^^^^^^^ Sequential time 21967.9 (ms) <nworkers> FastFlow time (ms) OpenMP time (ms) (gcc 4.3.2) 4 10448.3 8892.13 8 6111.95 5138.76 12 4530.39 4336.8 16 4771.28 4145.57 20 4621.81 4284.28 24 5318.03 4346.22 ------------------------ pbzip2 ------------------------ PBZIP2 is a parallel implementation by Jeff Gilchrist, of the bzip2 block-sorting file compressor that uses pthreads and achieves near-linear speedup on SMP machines. You can find PBZIP2 at http://compression.ca/pbzip2/. We ported the PBZIP2 code under the FastFlow framework; the source code of the new version can be found in pbzip2_ff.cpp. Our main objective in rewriting the application, were to show how to rewrite a pthread-based task-farm application using FastFlow's task-farm schema. A detailed description of the PBZIP2 algorithm can be found in the paper: "Parallel Data Compression with BZIP2" by Jeff Gilchrist available at http://gilchrist.ca/jeff/papers/Parallel_BZIP2.pdf Observe, the main focus of the FastFlow porting is not improving the performance of Gilchrist's PBZIP2, which is already very efficient; it exhibits nearly optimal speedup. The FastFlow porting rather aims to demonstrate that: - FastFlow makes it possible to achieve the parallelisation of the problem exploiting high-level parallel patterns instead of a hand-tuned design with almost no performance penalty and productivity gain. - Fastflow non-blocking synchronizations exibit reasonable performances against traditional blocking synchronizations also in worst case scenarios, such as coarse grain elaborations. In this latter respect, imagine a farm schema where you need the collector filter in order to perform some post elaboration (i.e. writing data into files, buffering tasks for maintaining data ordering, etc.). In this case, the non-blocking collector thread might not have anything to compute for long periods of time because worker threads are slow in producing output tasks. So, it would be better to stop the collector thread letting it wait to be awoken by one of the workers as soon as there are some tasks to post-elaborate. In these situations, non-blocking behaviour might waste CPU cycles, preventing other threads to do something useful (obviously this could happen mainly in those cases where one have more threads than cores). We proved that by carefully using FastFlow mechanisms you can obtain almost the same linear speedup of the hand-tuned mutex-based version of the same code even in the non optimal case of farm schema with collector filter and coarse grained computation. Performance (*) Commands used: pbzip2_ff -k -v -f linux-2.6.32.8.tar (Compression) pbzip2_ff -k -v -f -d linux-2.6.32.8.tar.bz2 (Decompression) Machine: Dual quad-core Xeon E5520 Nehalem @2.26GHz (8-cores 16-contexts) ^^^^^^^ bzip2(v.1.0.3) pbzip2 pbzip2_ff Compression 64.25 7.083 7.064 Decompression 14.67 1.8060 1.8728 Machine: 32-cores AMD machine (4 CPUs 8 cores x CPU) ^^^^^^^^ bzip2(v.1.0.5) pbzip2 pbzip2_ff Compression 61.13 3.446733 3.432033 Decompression 13.02 1.311154 1.8728 Machine: Dual 12-cores Opteron 6174 @ 2.2GHz (12-cores 2-CPUs) ^^^^^^^ bzip2(v.1.0.5) pbzip2(v.1.1.1) pbzip2_ff Compression 69.15 4.334679 4.354582 Decompression 16.92 1.453813 1.931218 * We executed 10-runs, removed the min and the max values obtained, and than computed the mean.