bes_timing: A Shell repository from OPeNDAP™

This README contains information about both the BES timing tests and
the memory profiling tests.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Timing Code

Utilizing a stopwatch/timer class, with a use pattern that guarantees
that the stop method for the timer will be called, we instrumented the
C++ components of Hyrax to better understand where time was being
spent in the server.

When the debugging is activated for the BES with either the keywords
"timing" or "all" the BES will produces timing output that shows the
order of execution and the elapsed times of the calling method.

More about the stop watch class can be found here:

http://docs.opendap.org/index.php/BES_Timing

We have instrumented the BES in the following locations:

- In the BES dispatch code that processes client commands. We know
the total time it takes to handle each XML command in a BES XML
request document sent from the client.

- In the hdf4_handler, hdf5_handler, ncml_module, netcdf_handler,
ugrid_functions, and w10n_handler response production methods. We
know the time taken taken to produce the DDS, DODS, DDS, and DMR
responses.

- Additional timing instrumentation may be added to as we come to
understand more clearly where time is being spent.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Memory profiling

To see how much memory the BES was using for a particular request,
we used valgrind and its 'massif' tool. This provides both detailed and
overview information about the heap allocation patterns made by the
server.

The script 'heap_profiler' can be used to simplify replicating the
same set of options when calling valgrind. We call heap_profler
with a bescmd file as its argument. The result is a 'massif' output
file with the same basename as that bescmd file. Then use ms_print
to look at the data in the *.massif file.

Examples:

The command

./heap_profiler small_aggregations/TRMM_3A11_hdf4_aggregation_small.ncml.dods-bescmd.xml

will run the given bescmd file and write three files:

small_aggregations/TRMM_3A11_hdf4_aggregation_small.ncml.dods-bescmd.xml.massif
small_aggregations/TRMM_3A11_hdf4_aggregation_small.ncml.dods-bescmd.xml.massif.log
small_aggregations/TRMM_3A11_hdf4_aggregation_small.ncml.dods-bescmd.xml.massif.out

Use ms_print to look at the heap; the .log and .out files are the
output of valgrind (which shows file open/close operations with
massif) and the timing information (largely moot since heap profiling
slows the code).

Here's how we looked at the heap profiler data:

ms_print --x=60 small_aggregations/AIRX3_hdf4_aggregation_small_2.dods-bescmd.xml.massif | more

Note that the --x=60 limits the ASCII graph to 60 characters width
which can be pasted into a Google doc without looking too ugly. Piping
through more makes the long profile output more manageable.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Memory tests run on the aggregation files

We looked at the heap profiles for four different aggregations: Two for
the AIRX 'small' aggregation made using 10 files and two for the TRMM_3A11
aggregation. The bescmd files were (all inside bes/timing/
small_aggregations):

File What
_______________________________________________________________________
AIRX3_hdf4_aggregation_small.dods-bescmd.xml 14 vars
AIRX3_hdf4_aggregation_small_2.dods-bescmd.xml 43 vars
TRMM_3A11_hdf4_aggregation_small.ncml.dods-bescmd.xml 1 var
TRMM_3A11_hdf4_aggregation_small_2.ncml.dods-bescmd.xml Whole dataset

For each of these, we ran the tests on a 'baseline' BES that used libdap
3.14.1 and also an 'improved' libdap (3.14.2) that is more efficient in
the way it uses its memory. We did this using two VMs (from the same
snapshot), but building from two versions of libdap will work as well.

To get the 'baseline' version of libdap, get the code tagged 3.14.1;
the improved code is now the master branch.

To see our results, look at: https://docs.google.com/a/opendap.org/document/d/1qOWmuhSWEthW2dCMlRcWzbjQsvobMSO1hJAp3RkoqB0/edit?usp=sharing

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
BES Timing Tests

- Test Commands -

Beneath this directory you will find two subdirectories, 'test' and
four directories named for the four aggregations used in the timing
and profiling tests. In each there is a collection of files containing
BES commands, each one as would be issued by the OLFS to an attached
BES listener process. The bes command files are named according to the
dataset, the subset, and the response type that is embodied in the
request.

- Shell Scripts -

Also in this directory you will find shell scripts that will run the
timing tests and extract a brief summary of the results.

-- mkAggTests --

This shell script runs the 4 different sets of aggregation timing test
commands in the agg_perf_test directory. The timing summaries are
recorded in files named in keeping with each aggregation's command
set. NB: This command might work a bit differently given changes to
timeTest. It's best to read it over before using. The key thing is to
not inadvertently erase any logs you've worked hard to make with
subsequent runs of the timing code.

-- timeTest --

This script takes as its command line a list of file names that are
assumed to contain BES XML requests.

- timeTest assumes that the environment variable 'prefix' has been
set, and will utilize $prefix to locate the BES configuration file
at $prefix/etc/bes/bes.conf

- The BES response will be placed in a file named by adding the suffix
'.result' to the command file name.

- The debug output will be placed in a file named by adding the suffix
'.log' to the command file name.

- timeTest launches besstandalone like this: besstandalone -c
$prefix/etc/bes/bes.conf -d "cerr,timing" -i $command >
$resultFile 2>>$logFile

- There is a shell variable, 'reps', set at the top of the timeTest
script that controls the number of times timeTest will run the
command. Depending on the speed of the test execution you may want
to adjust that value. NB: I added a command line option (-r) so that
this can be set without hacking the script. jhrg

- Also added: two more options (in addition to -r) to control the
debug command passed to besstandalone: -d <value> adds <value> to
the debug symbols and -D which turns off all debugging, including
the 'timing' information. Use this to determine the amount of
overhead the basic timing code adds.

Once timeTest has completed all of the reps, it analyzes the log file
and computes the average time it takes for every DDS, DAS, DODS, DMR,
and DAP response and prints that summary to stdout.

USAGE:

Run the fnoc1*.xml commands 10 times (the default reps) and collect
data from the 'timing' symbol only (the default debug setting):

./timeTest tests/fnoc1*.xml

Run the tests two times:

./timeTest -r 2 tests/fnoc1*.xml: Run them twice

Run the tests two time but collect data from 'timing' and 'ncml':

./timeTest -r 2 -d ncml tests/fnoc1*.xml

-- timingIndent --

The timingIndent script reads from stdin and writes to stdout. It will
look through its input stream for timing log lines and indent them
according to how many timing START vs timing STOP lines it encounters.

USAGE

cat $logFile | ./timingIndent

OPENDAP/bes_timing