/osu-micro-benchmarks

MPI Microbenchmarks

Primary LanguageCOtherNOASSERTION

This code is from: http://mvapich.cse.ohio-state.edu/benchmarks/

See latest updates here: http://mvapich.cse.ohio-state.edu/static/media/mvapich/CHANGES-OMB.txt

This is NOT my code; I am simply using git to track some of my modifications to this BSD-licensed project.

--------------------------

OMB (OSU Micro-Benchmarks)
--------------------------
The OSU Micro-Benchmarks use the GNU build system. Therefore you can simply
use the following steps to build the MPI benchmarks.

Example:
	./configure CC=/path/to/mpicc CXX=/path/to/mpicxx
	make
	make install

CC and CXX can be set to other wrapper scripts as well to build OpenSHMEM or
UPC++ benchmarks as well.  Based on this setting, configure will detect whether
your library supports MPI-1, MPI-2, MPI-3, OpenSHMEM, and UPC++ to compile the
corresponding benchmarks.  See http://mvapich.cse.ohio-state.edu/benchmarks/ to
download the latest version of this package.

This package also distributes UPC put, get, and collective benchmarks.
These are located in the upc subdirectory and can be compiled by the
following:

        upcc upc/osu_upc_memput.c -o upc/osu_upc_memput
        upcc upc/osu_upc_memget.c -o upc/osu_upc_memget

        upcc upc/osu_upc_all_barrier.c upc/osu_common.c     \
            -o upc/osu_upc_all_barrier
        upcc upc/osu_upc_all_broadcast.c upc/osu_common.c   \
            -o upc/osu_upc_all_broadcast
        upcc upc/osu_upc_all_exchange.c upc/osu_common.c    \
            -o upc/osu_upc_all_exchange
        upcc upc/osu_upc_all_gather_all.c upc/osu_common.c  \
            -o upc/osu_upc_all_gather_all
        upcc upc/osu_upc_all_gather.c upc/osu_common.c      \
            -o upc/osu_upc_all_gather
        upcc upc/osu_upc_all_reduce.c upc/osu_common.c      \
            -o upc/osu_upc_all_reduce.c
        upcc upc/osu_upc_all_scatter.c upc/osu_common.c     \
            -o upc/osu_upc_all_scatter

The MPI Multiple Bandwidth / Message Rate (osu_mbw_mr), OpenSHMEM Put Message
Rate (osu_oshm_put_mr), and OpenSHMEM Atomics (osu_oshm_atomics) tests are
intended to be used with block assigned ranks.  This means that all processes
on the same machine are assigned ranks sequentially.

Rank	Block   Cyclic
----------------------
0	host1	host1
1	host1	host2
2	host1	host1
3	host1	host2
4	host2	host1
5	host2	host2
6	host2	host1
7	host2	host2

If you're using mpirun_rsh the ranks are assigned in the order they are seen in
the hostfile or on the command line.  Please see your process managers'
documentation for information on how to control the distribution of the rank to
host mapping.

Point-to-Point MPI Benchmarks
-----------------------------
osu_latency - Latency Test
    * The latency tests are carried out in a ping-pong fashion. The sender
    * sends a message with a certain data size to the receiver and waits for a
    * reply from the receiver. The receiver receives the message from the sender
    * and sends back a reply with the same data size. Many iterations of this
    * ping-pong test are carried out and average one-way latency numbers are
    * obtained. Blocking version of MPI functions (MPI_Send and MPI_Recv) are
    * used in the tests. This test is available here.

osu_latency_mt - Multi-threaded Latency Test
    * The multi-threaded latency test performs a ping-pong test with a single
    * sender process and multiple threads on the receiving process. In this test
    * the sending process sends a message of a given data size to the receiver
    * and waits for a reply from the receiver process. The receiving process has
    * a variable number of receiving threads (set by default to 2), where each
    * thread calls MPI_Recv and upon receiving a message sends back a response
    * of equal size. Many iterations are performed and the average one-way
    * latency numbers are reported. This test is available here.

osu_bw - Bandwidth Test
    * The bandwidth tests were carried out by having the sender sending out a
    * fixed number (equal to the window size) of back-to-back messages to the
    * receiver and then waiting for a reply from the receiver. The receiver
    * sends the reply only after receiving all these messages. This process is
    * repeated for several iterations and the bandwidth is calculated based on
    * the elapsed time (from the time sender sends the first message until the
    * time it receives the reply back from the receiver) and the number of bytes
    * sent by the sender. The objective of this bandwidth test is to determine
    * the maximum sustained date rate that can be achieved at the network level.
    * Thus, non-blocking version of MPI functions (MPI_Isend and MPI_Irecv) were
    * used in the test. This test is available here.

osu_bibw - Bidirectional Bandwidth Test
    * The bidirectional bandwidth test is similar to the bandwidth test, except
    * that both the nodes involved send out a fixed number of back-to-back
    * messages and wait for the reply. This test measures the maximum
    * sustainable aggregate bandwidth by two nodes. This test is available here.

osu_mbw_mr - Multiple Bandwidth / Message Rate Test
    * The multi-pair bandwidth and message rate test evaluates the aggregate
    * uni-directional bandwidth and message rate between multiple pairs of
    * processes. Each of the sending processes sends a fixed number of messages
    * (the window size) back-to-back to the paired receiving process before
    * waiting for a reply from the receiver. This process is repeated for
    * several iterations. The objective of this benchmark is to determine the
    * achieved bandwidth and message rate from one node to another node with a
    * configurable number of processes running on each node. The test is
    * available here.

osu_multi_lat - Multi-pair Latency Test (requires threading support from MPI-2)
    * This test is very similar to the latency test. However, at the same
    * instant multiple pairs are performing the same test simultaneously.
    * In order to perform the test across just two nodes the hostnames must
    * be specified in block fashion.

Collective MPI Benchmarks
-------------------------
osu_allgather     - MPI_Allgather Latency Test(*)
osu_allgatherv    - MPI_Allgatherv Latency Test
osu_allreduce     - MPI_Allreduce Latency Test
osu_alltoall      - MPI_Alltoall Latency Test
osu_alltoallv     - MPI_Alltoallv Latency Test
osu_barrier       - MPI_Barrier Latency Test
osu_bcast         - MPI_Bcast Latency Test
osu_gather        - MPI_Gather Latency Test(*)
osu_gatherv       - MPI_Gatherv Latency Test
osu_reduce        - MPI_Reduce Latency Test
osu_reduce_scater - MPI_Reduce_scatter Latency Test
osu_scatter       - MPI_Scatter Latency Test(*)
osu_scatterv      - MPI_Scatterv Latency Test

Collective Latency Tests
    * The latest OMB version includes benchmarks for various MPI blocking
    * collective operations (MPI_Allgather, MPI_Alltoall, MPI_Allreduce,
    * MPI_Barrier, MPI_Bcast, MPI_Gather, MPI_Reduce, MPI_Reduce_Scatter,
    * MPI_Scatter and vector collectives). These benchmarks work in the
    * following manner.  Suppose users run the osu_bcast benchmark with N
    * processes, the benchmark measures the min, max and the average latency of
    * the MPI_Bcast collective operation across N processes, for various
    * message lengths, over a large number of iterations. In the default
    * version, these benchmarks report the average latency for each message
    * length. Additionally, the benchmarks offer the following options:
    * "-f" can be used to report additional statistics of the benchmark,
           such as min and max latencies and the number of iterations.
    * "-m" option can be used to set the minimum and maximum message length
           to be used in a benchmark. In the default version, the benchmarks
           report the latencies for up to 1MB message lengths. Examples:
            -m 128      // min = default, max = 128
            -m 2:128    // min = 2, max = 128
            -m 2:       // min = 2, max = default
    * "-x" can be used to set the number of warmup iterations to skip for each
           message length.
    * "-i" can be used to set the number of iterations to run for each message
           length.
    * "-M" can be used to set per process maximum memory consumption.  By
           default the benchmarks are limited to 512MB allocations.


Support for CUDA Managed Memory
---------------------------------
The following benchmarks have been extended to evaluate performance of MPI communication
from and to buffers allocated using CUDA Managed Memory.

    * osu_bibw - Bidirectional Bandwidth Test
    * osu_bw - Bandwidth Test
    * osu_latency - Latency Test
    * osu_allgather - MPI_Allgather Latency Test
    * osu_allgatherv - MPI_Allgatherv Latency Test
    * osu_allreduce - MPI_Allreduce Latency Test
    * osu_alltoall - MPI_Alltoall Latency Test
    * osu_alltoallv - MPI_Alltoallv Latency Test
    * osu_bcast - MPI_Bcast Latency Test
    * osu_gather - MPI_Gather Latency Test
    * osu_gatherv - MPI_Gatherv Latency Test
    * osu_reduce - MPI_Reduce Latency Test
    * osu_reduce_scatter - MPI_Reduce_scatter Latency Test
    * osu_scatter - MPI_Scatter Latency Test
    * osu_scatterv - MPI_Scatterv Latency Test

In addition to support for communications to and from GPU memories allocated
using CUDA or OpenACC, we now provide additional capability of performing
communications to and from buffers allocated using the CUDA Managed Memory concept.
CUDA Managed (or Unified) Memory allows applications to allocate memory on either CPU
or GPU memories using the cudaMallocManaged() call. This allows user oblivious transfer
of the memory buffer between the CPU or GPU. Currently, we offer benchmarking with CUDA
Managed Memory using the tests mentioned above.

These benchmarks have additional options:
    * "M" allocates a send or receive buffer as managed for point to point communication.
    * "-d managed" uses managed memory buffers to perform collective communications.


Non-Blocking Collective MPI Benchmarks
--------------------------------------
osu_iallgather    - MPI_Iallgather Latency Test
osu_iallgatherv   - MPI_Iallgatherv Latency Test
osu_ialltoall     - MPI_Ialltoall Latency Test
osu_ialltoallv    - MPI_Ialltoallv Latency Test
osu_ialltoallw    - MPI_Ialltoallw Latency Test
osu_ibarrier      - MPI_Ibarrier Latency Test
osu_ibcast        - MPI_Ibcast Latency Test
osu_igather       - MPI_Igather Latency Test
osu_igatherv      - MPI_Igatherv Latency Test
osu_iscatter      - MPI_Iscatter Latency Test
osu_iscatterv     - MPI_Iscatterv Latency Test

Non-Blocking Collective Latency Tests
    * In addition to the blocking collective latency tests, we provide several
    * non-blocking collectives as mentioned above. These evaluate the same
    * metrics as the blocking operations as well as the additional metric
    * `overlap'.  This is defined as the amount of computation that can be
    * performed while the communication progresses in the background.
    * These benchmarks have the additional option:
    * "-t" set the number of MPI_Test() calls during the dummy computation, set
           CALLS to 100, 1000, or any number > 0.


One-sided MPI Benchmarks
------------------------
osu_put_latency - Latency Test for Put with Active/Passive Synchronization
    * The put latency benchmark includes window initialization operations
    * (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and
    * synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
    * MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
    * MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
    * synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
    * the origin process calls MPI_Put to directly place data of a certain size
    * in the remote process's window and then waiting on a synchronization call
    * (MPI_Win_complete) for completion.  The remote process participates in
    * synchronization with MPI_Win_post and MPI_Win_wait calls. Several
    * iterations of this test is carried out and the average put latency
    * numbers is reported. The latency includes the synchronization time also.
    * For passive synchronization, suppose users run with MPI_Win_lock/unlock,
    * the origin process calls MPI_Win_lock to lock the target process's window
    * and calls MPI_Put to directly place data of certain size in the window.
    * Then it calls MPI_Win_unlock to ensure completion of the Put and release
    * lock on the window. This is carried out for several iterations and the
    * average time for MPI_Lock + MPI_Put + MPI_Unlock calls is measured. The
    * default window initialization and synchronization operations are
    * MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following
    * options:
    * "-w create"       use MPI_Win_create to create an MPI Window object.
    * "-w allocate"     use MPI_Win_allocate to create an MPI Window object.
    * "-w dynamic"      use MPI_Win_create_dynamic to create an MPI Window
    *                   object.
    * "-s lock"         use MPI_Win_lock/unlock synchronizations calls.
    * "-s flush"        use MPI_Win_flush synchronization call.
    * "-s flush_local"  use MPI_Win_flush_local synchronization call.
    * "-s lock_all"     use MPI_Win_lock_all/unlock_all synchronization calls.
    * "-s pscw"         use Post/Start/Complete/Wait synchronization calls.
    * "-s fence"        use MPI_Win_fence synchronization call.
    * "-x"              can be used to set the number of warmup iterations to
                        skip for each message length.
    * "-i"              can be used to set the number of iterations to run for
                        each message length.

osu_get_latency - Latency Test for Get with Active/Passive Synchronization
    * The get latency benchmark includes window initialization operations
    * (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and
    * synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
    * MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
    * MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
    * synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
    * the origin process calls MPI_Get to directly fetch data of a certain size
    * from the target process's window into a local buffer. It then waits on a
    * synchronization call (MPI_Win_complete) for local completion of the Gets.
    * The remote process participates in synchronization with MPI_Win_post and
    * MPI_Win_wait calls. Several iterations of this test is carried out and
    * the average get latency numbers is reported. The latency includes the
    * synchronization time also. For passive synchronization, suppose users run
    * with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock
    * the target process's window and calls MPI_Get to directly read data of
    * certain size from the window. Then it calls MPI_Win_unlock to ensure
    * completion of the Get and releases lock on remote window. This is carried
    * out for several iterations and the average time for MPI_Lock + MPI_Get +
    * MPI_Unlock calls is measured.  The default window initialization and
    * synchronization operations are MPI_Win_allocate and MPI_Win_flush. The
    * benchmark offers the following options:
    * "-w create"       use MPI_Win_create to create an MPI Window object.
    * "-w allocate "    use MPI_Win_allocate to create an MPI Window object.
    * "-w dynamic"      use MPI_Win_create_dynamic to create an MPI Window
    *                   object.
    * "-s lock"         use MPI_Win_lock/unlock synchronizations calls.
    * "-s flush"        use MPI_Win_flush synchronization call.
    * "-s flush_local"  use MPI_Win_flush_local synchronization call.
    * "-s lock_all"     use MPI_Win_lock_all/unlock_all synchronization calls.
    * "-s pscw"         use Post/Start/Complete/Wait synchronization calls.
    * "-s fence"        use MPI_Win_fence synchronization call.

osu_put_bw - Bandwidth Test for Put with Active/Passive Synchronization
    * The put bandwidth benchmark includes window initialization operations
    * (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and
    * synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
    * MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
    * MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
    * synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
    * the test is carried out by the origin process calling a fixed number of
    * back-to-back MPI_Puts on remote window and then waiting on a
    * synchronization call (MPI_Win_complete) for their completion. The remote
    * process participates in synchronization with MPI_Win_post and
    * MPI_Win_wait calls. This process is repeated for several iterations and
    * the bandwidth is calculated based on the elapsed time and the number of
    * bytes put by the origin process. For passive synchronization, suppose
    * users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock
    * to lock the target process's window and calls a fixed number of
    * back-to-back MPI_Puts to directly place data in the window. Then it calls
    * MPI_Win_unlock to ensure completion of the Puts and release lock on
    * remote window. This process is repeated for several iterations and the
    * bandwidth is calculated based on the elapsed time and the number of bytes
    * put by the origin process. The default window initialization and
    * synchronization operations are MPI_Win_allocate and MPI_Win_flush.  The
    * benchmark offers the following options:
    * "-w create"       use MPI_Win_create to create an MPI Window object.
    * "-w allocate"     use MPI_Win_allocate to create an MPI Window object.
    * "-w dynamic"      use MPI_Win_create_dynamic to create an MPI Window
    *                   object.
    * "-s lock"         use MPI_Win_lock/unlock synchronizations calls.
    * "-s flush"        use MPI_Win_flush synchronization call.
    * "-s flush_local"  use MPI_Win_flush_local synchronization call.
    * "-s lock_all"     use MPI_Win_lock_all/unlock_all synchronization calls.
    * "-s pscw"         use Post/Start/Complete/Wait synchronization calls.
    * "-s fence"        use MPI_Win_fence synchronization call.

osu_get_bw - Bandwidth Test for Get with Active/Passive Synchronization
    * The get bandwidth benchmark includes window initialization operations
    * (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and
    * synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
    * MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
    * MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
    * synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
    * the test is carried out by origin process calling a fixed number of
    * back-to-back MPI_Gets and then waiting on a synchronization call
    * (MPI_Win_complete) for their completion. The remote process participates
    * in synchronization with MPI_Win_post and MPI_Win_wait calls. This process
    * is repeated for several iterations and the bandwidth is calculated based
    * on the elapsed time and the number of bytes received by the origin
    * process. For passive synchronization, suppose users run with
    * MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the
    * target process's window and calls a fixed number of back-to-back MPI_Gets
    * to directly get data from the window. Then it calls MPI_Win_unlock to
    * ensure completion of the Gets and release lock on the window. This
    * process is repeated for several iterations and the bandwidth is
    * calculated based on the elapsed time and the number of bytes read by the
    * origin process.  The default window initialization and synchronization
    * operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers
    * the following options:
    * "-w create"       use MPI_Win_create to create an MPI Window object.
    * "-w allocate"     use MPI_Win_allocate to create an MPI Window object.
    * "-w dynamic"      use MPI_Win_create_dynamic to create an MPI Window
    *                   object.
    * "-s lock"         use MPI_Win_lock/unlock synchronizations calls.
    * "-s flush"        use MPI_Win_flush synchronization call.
    * "-s flush_local"  use MPI_Win_flush_local synchronization call.
    * "-s lock_all"     use MPI_Win_lock_all/unlock_all synchronization calls.
    * "-s pscw"         use Post/Start/Complete/Wait synchronization calls.
    * "-s fence"        use MPI_Win_fence synchronization.

osu_put_bibw - Bi-directional Bandwidth Test for Put with Active
               Synchronization
    * The put bi-directional bandwidth benchmark includes window initialization
    * operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic)
    * and synchronization operations (MPI_Win_Post/Start/Complete/Wait and
    * MPI_Win_fence).  This test is similar to the bandwidth test, except that
    * both the processes involved send out a fixed number of back-to-back
    * MPI_Puts and wait for their completion. This test measures the maximum
    * sustainable aggregate bandwidth by two processes. The default window
    * initialization and synchronization operations are MPI_Win_allocate and
    * MPI_Win_Post/Start/Complete/Wait. The benchmark offers the following
    * options:
    * "-w create"       use MPI_Win_create to create an MPI Window object.
    * "-w allocate"     use MPI_Win_allocate to create an MPI Window object.
    * "-w dynamic"      use MPI_Win_create_dynamic to create an MPI Window
    *                   object.
    * "-s pscw"         use Post/Start/Complete/Wait synchronization calls.
    * "-s fence"        use MPI_Win_fence synchronization call.

osu_acc_latency - Latency Test for Accumulate with Active/Passive
                  Synchronization
    * The accumulate latency benchmark includes window initialization
    * operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic)
    * and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
    * MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
    * MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
    * synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
    * the origin process calls MPI_Accumulate to combine data from the local
    * buffer with the data in the remote window and store it in the remote
    * window. The combining operation used in the test is MPI_SUM. The origin
    * process then waits on a synchronization call (MPI_Win_complete) for
    * completion of the operations. The remote process waits on a MPI_Win_wait
    * call. Several iterations of this test are carried out and the average
    * accumulate latency number is obtained. The latency includes the
    * synchronization time also.  For passive synchronization, suppose users
    * run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to
    * lock the target process's window and calls MPI_Accumulate to combine data
    * from a local buffer with the data in the remote window and store it in
    * the remote window.  Then it calls MPI_Win_unlock to ensure completion of
    * the Accumulate and release lock on the window. This is carried out for
    * several iterations and the average time for MPI_Lock + MPI_Accumulate +
    * MPI_Unlock calls is measured. The default window initialization and
    * synchronization operations are MPI_Win_allocate and MPI_Win_flush. The
    * benchmark offers the following options:
    * "-w create"       use MPI_Win_create to create an MPI Window object.
    * "-w allocate"     use MPI_Win_allocate to create an MPI Window object.
    * "-w dynamic"      use MPI_Win_create_dynamic to create an MPI Window
    *                   object.
    * "-s lock"         use MPI_Win_lock/unlock synchronizations calls.
    * "-s flush"        use MPI_Win_flush synchronization call.
    * "-s flush_local"  use MPI_Win_flush_local synchronization call.
    * "-s lock_all"     use MPI_Win_lock_all/unlock_all synchronization calls.
    * "-s pscw"         use Post/Start/Complete/Wait synchronization calls.
    * "-s fence"        use MPI_Win_fence synchronization call.

osu_cas_latency - Latency Test for Compare and Swap with Active/Passive
                  Synchronization
    * The Compare_and_swap latency benchmark includes window initialization
    * operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic)
    * and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
    * MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
    * MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
    * synchronization, suppose users run with
    * MPI_Win_Post/Start/Complete/Wait,the origin process calls
    * MPI_Compare_and_swap to place one element from  origin buffer to target
    * buffer.  The initial value in the target buffer is returned to the
    * calling process. The origin process then waits on a synchronization call
    * (MPI_Win_complete) for local completion of the operations. The remote
    * process waits on a MPI_Win_wait call. Several iterations of this test are
    * carried out and the average Compare_and_swap latency number is obtained.
    * The latency includes the synchronization time also.  For passive
    * synchronization, suppose users run with MPI_Win_lock/unlock, the origin
    * process calls MPI_Win_lock to lock the target process's window and calls
    * MPI_Compare_and_swap to place one element from  origin buffer to target
    * buffer. The initial value in the target buffer is returned to the calling
    * process. Then it calls MPI_Win_flush to ensure completion of the
    * Compare_and_swap. In the end, it calls MPI_Win_unlock to release lock on
    * the window. This is carried out for several iterations and the average
    * time for MPI_Compare_and_swap + MPI_Win_flush calls is measured. The
    * default window initialization and synchronization operations are
    * MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following
    * options:
    * "-w create"       use MPI_Win_create to create an MPI Window object.
    * "-w allocate"     use MPI_Win_allocate to create an MPI Window object.
    * "-w dynamic"      use MPI_Win_create_dynamic to create an MPI Window
    *                   object.
    * "-s lock"         use MPI_Win_lock/unlock synchronizations calls.
    * "-s flush"        use MPI_Win_flush synchronization call.
    * "-s flush_local"  use MPI_Win_flush_local synchronization call.
    * "-s lock_all"     use MPI_Win_lock_all/unlock_all synchronization calls.
    * "-s pscw"         use Post/Start/Complete/Wait synchronization calls.
    * "-s fence"        use MPI_Win_fence synchronization call.

osu_fop_latency - Latency Test for Fetch and Op with Active/Passive
                  Synchronization
    * The Fetch_and_op latency benchmark includes window initialization
    * operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic)
    * and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
    * MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
    * MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
    * synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
    * the origin process calls MPI_Fetch_and_op to increase the element in
    * target buffer by 1. The initial value from the target buffer is returned
    * to the calling process. The origin process waits on a synchronization
    * call (MPI_Win_complete) for completion of the operations. The remote
    * process waits on a MPI_Win_wait call. Several iterations of this test are
    * carried out and the average Fetch_and_op latency number is obtained. The
    * latency includes the synchronization time also.  For passive
    * synchronization, suppose users run with MPI_Win_lock/unlock, the origin
    * process calls MPI_Win_lock to lock the target process's window and calls
    * MPI_Compare_and_swap to place one element from  origin buffer to target
    * buffer. The initial value in the target buffer is returned to the calling
    * process. Then it calls MPI_Win_flush to ensure completion of the
    * Compare_and_swap. In the end, it calls MPI_Win_unlock to release lock on
    * the window. This is carried out for several iterations and the average
    * time for MPI_Compare_and_swap + MPI_Win_flush calls is measured. The
    * default window initialization and synchronization operations are
    * MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following
    * options:
    * "-w create"       use MPI_Win_create to create an MPI Window object.
    * "-w allocate"     use MPI_Win_allocate to create an MPI Window object.
    * "-w dynamic"      use MPI_Win_create_dynamic to create an MPI Window
    *                   object.
    * "-s lock"         use MPI_Win_lock/unlock synchronizations calls.
    * "-s flush"        use MPI_Win_flush synchronization call.
    * "-s flush_local"  use MPI_Win_flush_local synchronization call.
    * "-s lock_all"     use MPI_Win_lock_all/unlock_all synchronization calls.
    * "-s pscw"         use Post/Start/Complete/Wait synchronization calls.
    * "-s fence"        use MPI_Win_fence synchronization call.

osu_get_acc_latency - Latency Test for Get_accumulate with Active/Passive
                      Synchronization
    * The Get_accumulate latency benchmark includes window initialization
    * operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic)
    * and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
    * MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
    * MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
    * synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
    * the origin process calls MPI_Get_accumulate to combine data from the
    * local buffer with the data in the remote window and store it in the
    * remote window. The combining operation used in the test is MPI_SUM. The
    * initial value from the target buffer is returned to the calling process.
    * The origin process waits on a synchronization call (MPI_Win_complete) for
    * local completion of the operations. The remote process waits on a
    * MPI_Win_wait call. Several iterations of this test are carried out and
    * the average get accumulate latency number is obtained. The latency
    * includes the synchronization time also.  For passive synchronization,
    * suppose users run with MPI_Win_lock/unlock, the origin process calls
    * MPI_Win_lock to lock the target process's window and calls
    * MPI_Get_accumulate to combine data from a local buffer with the data in
    * the remote window and store it in the remote window.  The initial value
    * from the target buffer is returned to the calling process.  Then it calls
    * MPI_Win_unlock to ensure completion of the Get_accumulate and release
    * lock on the window. This is carried out for several iterations and the
    * average time for MPI_Lock + MPI_Get_accumulate + MPI_Unlock calls is
    * measured. The default window initialization and synchronization
    * operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers
    * the following options:
    * "-w create"       use MPI_Win_create to create an MPI Window object.
    * "-w allocate"     use MPI_Win_allocate to create an MPI Window object.
    * "-w dynamic"      use MPI_Win_create_dynamic to create an MPI Window
    *                   object.
    * "-s lock"         use MPI_Win_lock/unlock synchronizations calls.
    * "-s flush"        use MPI_Win_flush synchronization call.
    * "-s flush_local"  use MPI_Win_flush_local synchronization call.
    * "-s lock_all"     use MPI_Win_lock_all/unlock_all synchronization calls.
    * "-s pscw"         use Post/Start/Complete/Wait synchronization calls.
    * "-s fence"        use MPI_Win_fence synchronization call.

Point-to-Point OpenSHMEM Benchmarks
-----------------------------------
osu_oshm_put.c - Latency Test for OpenSHMEM Put Routine
    * This benchmark measures latency of a shmem putmem operation for different
    * data sizes. The user is required to select whether the communication
    * buffers should be allocated in global memory or heap memory, through a
    * parameter. The test requires exactly two PEs. PE 0 issues shmem putmem to
    * write data at PE 1 and then calls shmem quiet. This is repeated for a
    * fixed number of iterations, depending on the data size. The average
    * latency per iteration is reported. A few warm-up iterations are run
    * without timing to ignore any start-up overheads.  Both PEs call shmem
    * barrier all after the test for each message size.

osu_oshm_get.c - Latency Test for OpenSHMEM Get Routine
    * This benchmark is similar to the one above except that PE 0 does a shmem
    * getmem operation to read data from PE 1 in each iteration. The average
    * latency per iteration is reported.

osu_oshm_put_mr.c - Message Rate Test for OpenSHMEM Put Routine
    * This benchmark measures the aggregate uni-directional operation rate of
    * OpenSHMEM Put between pairs of PEs, for different data sizes. The user
    * should select for communication buffers to be in global memory and heap
    * memory as with the earlier benchmarks. This test requires number of PEs
    * to be even. The PEs are paired with PE 0 pairing with PE n/2 and so on,
    * where n is the total number of PEs. The first PE in each pair issues
    * back-to-back shmem putmem operations to its peer PE. The total time for
    * the put operations is measured and operation rate per second is reported.
    * All PEs call shmem barrier all after the test for each message size.

osu_oshm_atomics.c - Latency and Operation Rate Test for OpenSHMEM Atomics Routines
    * This benchmark measures the performance of atomic fetch-and-operate and
    * atomic operate routines sup- ported in OpenSHMEM for the integer
    * datatype. The buffers can be selected to be in heap memory or global
    * memory. The PEs are paired like in the case of Put Operation Rate
    * benchmark and the first PE in each pair issues back-to-back atomic
    * operations of a type to its peer PE. The average latency per atomic
    * operation and the aggregate operation rate are reported.  This is
    * repeated for each of fadd, finc, add, inc, cswap and swap routines.

Collective OpenSHMEM Benchmarks
-------------------------------
osu_oshm_collect   - OpenSHMEM Collect Latency Test
osu_oshm_fcollect  - OpenSHMEM FCollect Latency Test
osu_oshm_broadcast - OpenSHMEM Broadcast Latency Test
osu_oshm_reduce    - OpenSHMEM Reduce Latency Test
osu_oshm_barrier   - OpenSHMEM Barrier Latency Test

Collective Latency Tests
    * The latest OMB Version includes benchmarks for various OpenSHMEM
    * collective operations (shmem_collect, shmem_broadcast, shmem_reduce and
    * shmem_barrier). These benchmarks work in the following manner. Suppose
    * users run the osu_oshm_broadcast benchmark with N processes, the
    * benchmark measures the min, max and the average latency of the
    * shmem_broadcast collective operation across N processes, for various
    * message lengths, over a large number of iterations. In the default
    * version, these benchmarks report the average latency for each message
    * length. Additionally, the benchmarks offer the following options:
    * "-f" can be used to report additional statistics of the benchmark,
           such as min and max latencies and the number of iterations.
    * "-m" option can be used to set the maximum message length to be used in a
           benchmark. In the default version, the benchmarks report the
           latencies for up to 1MB message lengths.
    * "-i" can be used to set the number of iterations to run for each message
           length.

Point-to-Point UPC Benchmarks
-----------------------------
osu_upc_memput.c - Put Latency
    * This benchmark measures the latency of upc put operation between multiple
    * UPC threads. In this bench- mark, UPC threads with ranks less than
    * (THREADS/2) issues upc memput operations to peer UPC threads. Peer
    * threads are identified as (MYTHREAD+THREADS/2). This is repeated for a
    * fixed number of iterations, for varying data sizes. The average latency
    * per iteration is reported. A few warm-up iterations are run without
    * timing to ignore any start-up overheads. All UPC threads call upc barrier
    * after the test for each message size.

osu_upc_memget.c - Get Latency
    * This benchmark is similar as the osu upc put benchmark that is described
    * above. The difference is that the shared string handling function is upc
    * memget. The average get operation latency per iteration is reported.

Collective UPC Benchmarks
-------------------------
osu_upc_all_barrier     - UPC Barrier Latency Test
osu_upc_all_broadcast   - UPC Broadcast Latency Test
osu_upc_all_scatter     - UPC Scatter Latency Test
osu_upc_all_gather      - UPC Gather Latency Test
osu_upc_all_gather_all  - UPC GatherAll Latency Test
osu_upc_all_reduce      - UPC Reduce Latency Test
osu_upc_all_exchange    - UPC Exchange Latency Test

Collective Latency Tests
    * The latest OMB Version includes benchmarks for various UPC collective
    * operations (upc_all_barrier, upc_all_broadcast, upc_all_scatter,
    * upc_all_gather, upc_all_gather_all, osu_upc_all_reduce, and
    * upc_all_exchange). These benchmarks work in the following manner. Suppose
    * users run the osu_upc_all_broadcast benchmark with N processes, the
    * benchmark measures the min, max and the average latency of the
    * upc_all_broadcast collective operation across N processes, for various
    * message lengths, over a large number of iterations. In the default
    * version, these benchmarks report the average latency for each message
    * length. Additionally, the benchmarks offer the following options: "-f"
    * can be used to report additional statistics of the benchmark, such as min
    * and max latencies and the number of iterations. "-m" option can be used
    * to set the maximum message length to be used in a benchmark. In the
    * default version, the benchmarks report the latencies for up to 1MB
    * message lengths. "-i" can be used to set the number of iterations to run
    * for each message length.

Point-to-Point UPC++ Benchmarks
-------------------------------
osu_upcxx_async_copy_put.c - Put Latency
    * This benchmark measures the latency of the UPC++ async_copy operation
    * between multiple UPC++ threads. In this benchmark, UPC+ threads with
    * ranks less than (THREADS/2) issues UPC++ async_copy from local to remote
    * memory on peer threads. Peer threads are identified as
    * (MYTHREAD+THREADS/2). This is repeated for a fixed number of iterations,
    * for varying data sizes. The average latency per iteration is reported. A
    * few warm-up iterations are run without timing to ignore any start-up
    * overheads. All UPC++ threads call barrier after the test for each message
    * size.

osu_upcxx_async_copy_get.c - Get Latency
    * This benchmark is similar as the osu_upcxx_async_copy_put benchmark that
    * is described above. The difference is that the async_copy operation
    * copies from remote to local memory. The average get operation latency per
    * iteration is reported.

Collective UPC++ Benchmarks
---------------------------
osu_upcxx_allgather - UPC++ Allgather Latency Test
osu_upcxx_alltoall  - UPC++ Alltoall Latency Test
osu_upcxx_bcast     - UPC++ Broadcast Latency Test
osu_upcxx_gather    - UPC++ Gather Latency Test
osu_upcxx_reduce    - UPC++ Reduce Latency Test
osu_upcxx_scatter   - UPC++ Scatter Latency Test

Collective Latency Tests
    * The latest OMB Version includes benchmarks for various UPC++ collective
    * operations (upcxx_allgather, upcxx_alltoall, upcxx_bcast, upcxx_gather,
    * upcxx_reduce, and upcxx_scatter).  These benchmarks work in the following
    * manner. Suppose users run the osu_upcxx_bcast benchmark with N processes,
    * the benchmark measures the min, max and the average latency of the
    * upcxx_bcast collective operation across N processes, for various message
    * lengths, over a large number of iterations. In the default version, these
    * benchmarks report the average latency for each message length.
    * Additionally, the benchmarks offer the following options:
    * "-f" can be used to report additional statistics of the benchmark, such
    * as min and max latencies and the number of iterations.
    * "-m" option can be used to set the maximum message length to be used in a
    * benchmark. In the default version, the benchmarks report the latencies
    * for up to 1MB message lengths.
    * "-i" can be used to set the number of iterations to run for each message
    * length.

CUDA and OpenACC Extensions to OMB
----------------------------------
CUDA Extensions to OMB can be enable by configuring the benchmark suite with
--enable-cuda option as shown below.  Similarly, OpenACC Extensions can be
enabled by specifying the --enable-openacc option.  The MPI library used should
be able to support MPI communication from buffers in GPU Device memory.

    ./configure CC=/path/to/mpicc 
                CXX=/path/to/mpicxx
                --enable-cuda 
                --with-cuda-include=/path/to/cuda/include
                --with-cuda-libpath=/path/to/cuda/lib
    make
    make install

The following benchmarks have been extended to evaluate performance of
MPI communication using buffers on NVIDIA GPU devices.

    osu_bibw          - Bidirectional Bandwidth Test
    osu_bw            - Bandwidth Test
    osu_latency       - Latency Test
    osu_put_latency   - Latency Test for Put
    osu_get_latency   - Latency Test for Get
    osu_put_bw        - Bandwidth Test for Put
    osu_get_bw        - Bandwidth Test for Get
    osu_put_bibw      - Bidirectional Bandwidth Test for Put
    osu_acc_latency   - Latency Test for Accumulate
    osu_cas_latency   - Latency Test for Compare and Swap
    osu_fop_latency   - Latency Test for Fetch and Op
    osu_allgather     - MPI_Allgather Latency Test
    osu_allgatherv    - MPI_Allgatherv Latency Test
    osu_allreduce     - MPI_Allreduce Latency Test
    osu_alltoall      - MPI_Alltoall Latency Test
    osu_alltoallv     - MPI_Alltoallv Latency Test
    osu_bcast         - MPI_Bcast Latency Test
    osu_gather        - MPI_Gather Latency Test
    osu_gatherv       - MPI_Gatherv Latency Test
    osu_reduce        - MPI_Reduce Latency Test
    osu_reduce_scater - MPI_Reduce_scatter Latency Test
    osu_scatter       - MPI_Scatter Latency Test
    osu_scatterv      - MPI_Scatterv Latency Test
    osu_iallgather    - MPI_Iallgather Latency Test
    osu_iallgatherv   - MPI_Iallgatherv Latency Test
    osu_ialltoall     - MPI_Ialltoall Latency Test
    osu_ialltoallv    - MPI_Ialltoallv Latency Test
    osu_ialltoallw    - MPI_Ialltoallw Latency Test
    osu_ibcast        - MPI_Ibcast Latency Test
    osu_igather       - MPI_Igather Latency Test
    osu_igatherv      - MPI_Igatherv Latency Test
    osu_iscatter      - MPI_Iscatter Latency Test
    osu_iscatterv     - MPI_Iscatterv Latency Test

If both CUDA and OpenACC support is enabled you can switch between the modes
using the -d [cuda|openacc] option to the benchmarks.  Whether a process
allocates its communication buffers on the GPU device or on the host can be
controlled at run-time.  Use the -h option for more help.

    ./osu_latency -h
    Usage: osu_latency [options] [RANK0 RANK1]

    RANK0 and RANK1 may be `D' or `H' which specifies whether
    the buffer is allocated on the accelerator device or host
    memory for each mpi rank

    options:
      -d TYPE   accelerator device buffers can be of TYPE `cuda' or `openac'
      -h        print this help message

Each of the pt2pt benchmarks takes two input parameters. The first parameter
indicates the location of the buffers at rank 0 and the second parameter
indicates the location of the buffers at rank 1. The value of each of these
parameters can be either 'H' or 'D' to indicate if the buffers are to be on the
host or on the device respectively. When no parameters are specified, the
buffers are allocated on the host.  The collective benchmarks will use buffers
allocated on the device if the -d option is used otherwise the buffers will be
allocated on the host.

Examples:

    - mpirun_rsh -np 2 -hostfile hostfile MV2_USE_CUDA=1 osu_latency D D

In this run, the latency test allocates buffers at both rank 0 and rank 1 on
the GPU devices.

    - mpirun_rsh -np 2 -hostfile hostfile MV2_USE_CUDA=1 osu_bw D H

In this run, the bandwidth test allocates buffers at rank 0 on the GPU device
and buffers at rank 1 on the host.

Setting GPU affinity
--------------------
GPU affinity for processes is set before MPI_Init is called in the benchmarks.
The process rank on a node is normally used to do this and different MPI
launchers expose this information through different environment variables. The
benchmarks use an environment variable called LOCAL_RANK to get this
information.

A script like below can be used to export this environment variable when using
mpirun_rsh.  This can be adapted to work with other MPI launchers and
libraries.

    #!/bin/bash

    export LOCAL_RANK=$MV2_COMM_WORLD_LOCAL_RANK
    exec $*

A copy of this script is installed as get_local_rank alongside the benchmarks.
It can be used as follows:

    mpirun_rsh -np 2 -hostfile hostfile MV2_USE_CUDA=1 get_local_rank \
        ./osu_latency D D