GUPS distribution - 13 Oct 2006 This directory contains several implementations of an algorithm that can be used to run the HPCC RandomAccess (GUPS) benchmark. The algorithm is described on this WWW page: www.cs.sandia.gov/~sjplimp/algorithms/html#gups The tar file of codes can be downloaded from this WWW page: www.cs.sandia.gov/~sjplimp/download.html These codes are distributed by Steve Plimpton of Sandia National Laboratories: sjplimp@sandia.gov, www.cs.sandia.gov/~sjplimp -------------------------------------------------------------------------- This directory should contain the following files: gups_vanilla.c vanilla power-of-2 version of algorithm gups_nonpow2.c non-power-of-2 version gups_opt.c optimized power-of-2 version MPIRandomAccess_vanilla.c implementation of gups_vanilla in HPCC harness MPIRandomAccess_opt.c implementation of gups_opt in HPCC harness Makefile.* Makefiles for various machines -------------------------------------------------------------------------- The gups_* files are stand-alone single-file codes that can be built using a Makefile like those provided. E.g. make -f Makefile.linux gups_vanilla You will need to create a Makefile.* appropriate to your platform, that points at the correct MPI library, etc. Note that these 3 codes support a -DLONG64 C compiler flag. If a "long" on your processor is 32-bit (presumably long long is 64 bits), then don't use -DLONG64; if a "long" is 64 bits, then use -DLONG64. -------------------------------------------------------------------------- You can run any of the 3 gups* codes as follows: 1 proc: gups_vanilla N M chunk P procs: mpirun -np P gups_vanilla N M chunk where N = length of global table is 2^N M = # of update sets per proc chunk = # of updates in one set on each proc Note that 2^N is the length of the global table across all processors. Thus N = 30 would run with a billion-element table. Chunk is the number of updates each proc will do before communicating. In the official HPCC benchmark this is specified to be no larger than 1024, but you can run the code with any value you like. Your GUPS performance will typically decrease for smaller chunk size. When each proc performs "chunk" updates, that is one "set" of updates. M determines how many sets are performed. The GUPS performance is a "rate", so it's independent of M, once M is large enough to get good statistics. So you can start your testing with a small M to see how fast your machine runs with this algorithm, then get better stats with longer runs with a larger M. An official HPCC benchmark run requires M be a large number (like the total number of updates = 4x the table size, if I recall), but your GUPS rate won't change. After the code runs, it will print out some stats, like this: > mpirun -np 2 gups_vanilla 20 1000 1024 Number of procs: 2 Vector size: 1048576 Max datums during comm: 1493 Max datums after comm: 1493 Excess datums (frac): 39395 (0.0192358) Bad locality count: 0 Update time (secs): 0.383 Gups: 0.005351 "Vector size" is the length of the global table. The "max datums" values tell how message size varied as datums were routed thru the hypercube dimensions. They should only exceed "chunk" by a modest amount. However the random number generation in the HPCC algorithm is not very random, so in the first few iterations a few procs tend to receive larger messages. The "excess datums" value is the number of updates (and fraction) that would have been missed if datums greater than the chunk size were discarded. It should typically be < 1% for long runs. The codes do not discard these excess updates. The "bad locality" should be 0. If the code was compiled with -DCHECK and a non-zero value results, it means some procs are trying to perform table updates on table indices they don't own, so something is wrong. The "update time" is how long the code ran. "Gups" is the GUPS performance rate, as HPCC defines it. Namely the total # of updates per second across all processors. The total # of updates is M*chunk*P, where P = # of processors. Once you run on enough processors (e.g. 32), you should see the GUPS rate nearly double each time you double the number of procs, unless communication on your machine is slowing things down. -------------------------------------------------------------------------- The MPIRandomAccess*.c codes are versions of the same algorithms implemented within the framework that HPCC provides to enable users to implement new optimized algorithms. In principle you can take these files and drop them into the HPCC harness, re-compile the HPCC suite, and run an official HPCC benchmark test with the new algorithms. In practice, I don't know the specifics of how to do that! Courtenay Vaughan at Sandia was the one who worked on that part of the project. You can email him if you have questions at ctvaugh@sandia.gov. You should get essentially the same GUPS number when running these algorithms in the HPCC harness as you get with the stand-alone codes. Note that we have only ported the vanilla and opt algorithms (not the non-power-of-2 version) to the HPCC framework.