/churn

Memory churn application to test limits of OpenJDK garbage collectors

Primary LanguageJava

Building
--------

You build the churn application using mvn package.
However javac can do just fine, and if no maven is around, the top level run.sh will use it.

Running
-------

There are scripts in the bin directory to run it with the various GCs
available in Hotspot/OpenJDK
  rung1.sh
  runcms.sh
  runpar.sh
  runparold.sh
  runshenandoah.sh
  runzgc.sh
  runzgcgen.sh

The scripts dump a suitably labelled gc log and output file showing
the gc trace output and the program output e.g. if you execute
run-g1.sh the fiels will be
  gclog-g1
  outlog-g1

You can also run with the compressed oops optimization enabled using
the following scripts
  rung1-nocoops.sh
  runcms-nocoops.sh
  runpar-nocoops.sh
  runparold-nocoops.sh
  runshenandoah-nocoops.sh
  runzgc-nocoops.sh
  runzgcgen-nocoops.sh

Next to it, there is also top level
  run.sh
Which is wrapping above bin/* by more reasonable defaults, and/or allows you to run them ALL/defaultgc
The top level run.sh can be called as:
  run.sh <somegc> <someseconds>
  run.sh "<gc1 gc2...>" (note the quotes) which then iterates through selectedGCs or set ALL to try all GCs
  eg: run.sh "zgc shenandoah" 36000 which will run 10 hours zgc and then 10 hours of shenandoah. If you JVM do notsupport any of them, it will fail
It accepts:   HEAPSIZE=3g  ITEMS=250  THREADS=2  DURATION=18000 # 5 hours (in seconds)#  COMPUTATIONS=64  BLOCKS=16 OTOOL_garbageCollector and JAVA_HOME env variables
The variables have priority over arguments
The top level run.sh can generate junit-like xml and tapfile at the end, and is compressing all the logs to single archive (they can be huge)
Note, that if more then one gc is part of the  argument/OTOOL_garbageCollector final enumeration, the DURATION applied to each of them. if you use ALL, the DURATION is split among final set (as you never know how much you will actually run)
The top level run.sh is the only runner which can run from custom directory.

Arguments
---------

The arguments to the program (and the run scripts) are optional in
which case the program uses suitable defaults

  -blocks B [default 4] how many data blocks to allocate per work item
  -items I [default 4000] how many thousand local/global work items in the work set
  -threads T [default 8] how many threads to use to do the processing
  -iterations N [default 200] how many times to update the local/gobal work set
  -duration D [default off] how long in seconds should churn run. overwrites -iterations
  -computations C [default 32] how many computes/write operations are
   performed on each allocated object 
  -slices S [default 100] number of allocate/compute operations per timed task
  -yield Y [default -1] yield (Y = 0) or sleep (for Y msecs) at end of slice
  

where B, I, T, N, D, C and S need to be supplied as positive integers
and Y may be 0 or a positive.

Output
------

The program output includes a total time for execution of the test
program and a histogram output showing the distribution of task
execution times for each nominal 'task' run by the program. A task
comprises a certain, configurable number of allocate and compute
operations on a working set of objects.

Operation
---------

The test program repeatedly updates a local (i.e. short lived) work
set by allocating and inserting work items. So, this work set is
continually being updated and most of the items are dropped with a
relatively short life time.

1 in 10 work items are also inserted into the global (long lived) work
set as well as the lcoal set. These items therefore tend to have a
lifetime approximately 10x > than the normal lifetime.

The local and global work sets (eventually) contain I million items
are each divided into T chunks updated independently by each of the T
threads. So, modifying the thread count does not affect the total work
load or data set size.

Each thread loops repeating updates to the work set N times. So, the
total number of allocations and insertions into the local work set
accumulated across all threads is (I * N) million and each thread
creates (I / T) * N items. Alternatively, when D is provided, new
iterations are started, with same rules mentioned, until running
time exceeds D seconds.

Each time a chunk of I / T items is updated there is a 1 in 100 chance
that the thread will purge all its entries from the global work set,
simulating an application phase change.

Items normally hold B 32 byte blocks. 1 in 100 items stores 2 1Kb
blocks and 1 in 200 items stores 1 32kb block. 1 in 1000 items holds
a single 1Mb block.

Items are also linked randomly with an average chain length of ~ 1.3.

Each allocated work item is modified by computing and writing C byte
values to the allocated byte blocks, cycling round to the start of the
block if necessary. So, by increasing C you can vary the allocation to
computation ratio.

The current msec time is measured after every S work item allocate &
compute operations i.e. this parameter groups a fixed number of work
item operations as a standard task whose time is measured repeatedly
over the run. Differences between sucessive timings are recorded as a
per task execution time. These timings will be fairly consistent
*except* where GC pauses thread execution.

By configuring S and C you can ensure that the sample interval is no
greater than a few milliseconds (by dfeault it will probably be much
less than this). Variation in the readings will be displayed in the
output histograms (per thread and accumulated across threads)
providing an indication of how often and how long new gen and old gen
GCs have interrupted execution of tasks.

If you are running with a lot more mutator threads than cores you may
want to simulate yields or pauses during task execution using the
-yields parameter (running with no pauses and a low compute time
merely serves to thrash the memory system). If Y is 0 then a thread
yields at the end of each slice. If Y is > 0 then it sleeps for Y
msecs. Note that yield and sleep times are not included in the next
task's time measurement.