memperf: A C repository from craft-zhang

/*
 * Memory System Performance Characterisation
 * ECT memperf - Extended Copy Transfer Characterization
 *
 * Thomas M. Stricker <tomstr@inf.ethz.ch>
 * Christian Kurmann  <kurmann@inf.ethz.ch>
 * http://www.cs.inf.ethz.ch/CoPs/ECT/
 *
 * Changes in Changelog
 */

memperf measures the memory bandwidth in a 2 dimensional way.
First it varies the block size which provides information of the
throughput in different memory system hierarchys (different cache 
levels). Secondly it varies the access pattern from contiguous
blocks to different strided accesses.

4 different tests are provided:

load sum (test -m 0 ):
The load sum test measures the memory load performance for all the
blocksizes and access patterns. It accumulates the values in order
to prevent the optimizing compiler to suppress the interesting part
of the test.

const store (test -m 1):
The const store test does the reverse operation of the load test.
It measures the store bandwith for all the blocksizes and access
patterns.

load copy (test -m 2):
The load copy does a strided load test and stores the result in
a contiguous way. It simulates a matrix transpose. It is performed
for all the blocksizes and access patterns.

copy store (test -m 3):
The copy store test is the opposite of the load copy test. It
performs a contiguous load and stores the data in strides. So the
result of the operation is the same as in the load copy test. Again,
all the blocksizes and access patterns are tested.


Usage: memperf -m <mode> [-p] [-s] [-n] [-r] [-i] [-t]
       -m <mode>     : 0 = load sum test
                       1 = const store test
                       2 = load copy test
                       3 = copy store test
                       9 = all of the above tests

       -p <nproc>    : Number of processes (Default: 1 process)
                       (numbers higher than processors in the system
                       make no sense and will give strange results)

       -s <mxstrds>  : Number of strides testet
                       (Default: 22 different strides)

       -n <mxsize>   : Maximum block size tested [2^x double values]
                       (Default: 20 = 8MB)

       -r <minsize>  : Minimum block size tested [2^x double values]
                       (Default: 6 = 512 Bytes)

       -i <mxiters>  : Number of iterations for each test (Default: 16)
                       (the number of iterations is adaptivly chosen to the
                       examined block size, so it does not refers to very
                       small and very large blocks)

       -t <tics/us>  : When using the high resolution clock counter the
        (unix only)    program tries to autodetect the clock frequency.
                       This should work on linux/x86 and linux/alpha systems,
                       on other systems the autodetection might not be
                       reliable, especially on MP systems, so you can
                       override the autodetection.

       -a <useoptasm>: 0 = don't use optimized functions/special instructions
        (unix only)    1 = use only optimized functions (slower in some cases)
                       2 = both methods (Default: 0)
                       (currently only possible with x86 systems, needs CPU
                       with SSE or Enhanced 3dnow! support)

       -c <nrofrep>  : Number of repetitions of each test (Default:3)
                       (to increase reliability of the results, you shouldn't
                       use 1 (especially not in uniprocessor systems), of
                       course the higher the number the longer it takes
                       to complete the benchmark)

       -o <chartrev> : revert chart output (to make import in certain programs
                       easier)



Results:

The maximum results of the test are stored in files (one file for each mode).
The naming convention of the files is as follows:
chart.m0.p2.max      this is the maximum result of a mode 0 test with two
                     processors.

If you want the individual results of each repetition of the benchmark
you need to change the #define chart in lcpy.c, otherwise only the max
files will be generated.
These individual results of the tests are stored in files (one file for each process, each
repetition and each mode).
The naming convention of these files is as follows:
chart.m0.p2.out.r3.2 this is the result of the second process of the third
                     repetition of a mode 0 test with two processors.

All files have the following format (8 character separated colons):

Load Sum    0.5 K     1 K     2 K     4 K
       1   327.68  402.06  431.16  449.65
       2   321.25  368.18  412.18  439.84
       3   280.49  344.98  388.57  425.28
       4   309.13  339.56  375.56  417.43
       5   287.10  316.83  350.97  406.10

The first column determines the stride, the first row the block size.
All values are MB/s.

Visualiation:
We use DeltaGraph 4.0 from Delta Point to visualize the results.
We therefore provide a DeltaGraph library deltagraph.lbr with the
chart. 
deltagraph.dg4 is an example DeltaGraph file with one chart.
deltagraph.ps is a sample print.
We also provide an Excel Spreadsheet which generates similar charts.
Feel free to modify it.


Papers:
To understand the benchmark in theory, further reading is provided in the 
following ISCA and HPCA papers:

T. Stricker, T.Gross Global Address Space, Non-Uniform Bandwidth: 
A Memory System Performance Characterization of Parallel Systems
Reprint from proceedings of HPCA'97, Feb 1-5,1997, San Antonio, TX.

T. Stricker and T. Gross. Optimizing Memory System Performance for 
Communication in Parallel Computers . 
Reprint from proceedings of ISCA'95, June 1995. 

Both papers are available under: http://www.cs.inf.ethz.ch/cops/ECT
craft-zhang/memperf