OMB (OSU Micro-Benchmarks) -------------------------- The OSU Micro-Benchmarks use the GNU build system. Therefore you can simply use the following steps to build the MPI benchmarks. Example: ./configure CC=/path/to/mpicc CXX=/path/to/mpicxx make make install CC and CXX can be set to other wrapper scripts as well to build OpenSHMEM or UPC++ benchmarks as well. Based on this setting, configure will detect whether your library supports MPI-1, MPI-2, MPI-3, OpenSHMEM, and UPC++ to compile the corresponding benchmarks. See http://mvapich.cse.ohio-state.edu/benchmarks/ to download the latest version of this package. OMB also contains ROCm, CUDA and OpenACC extensions to the benchmarks. CUDA extensions can be enabled by configuring OMB with --enable-cuda option as shown below. ./configure CC=/path/to/mpicc CXX=/path/to/mpicxx --enable-cuda --with-cuda-include=/path/to/cuda/include --with-cuda-libpath=/path/to/cuda/lib make make install ROCm extensions can be enabled by configuring OMB with --enable-rocm option as shown below. Similarly OpenACC extensions can be enabled using --enable-openacc option. The MPI library used should be able to support MPI communication from buffers in GPU Device memory. ./configure CC=/path/to/mpicc CXX=/path/to/mpicxx --enable-rocm --with-rocm=/path/to/rocm/install make make install More information about the ROCm and CUDA extensions are given towards the end of the README. This package also distributes UPC put, get, and collective benchmarks. These are located in the upc subdirectory and can be compiled by the following: for bench in osu_upc_memput \ osu_upc_memget \ osu_upc_all_scatter \ osu_upc_all_reduce \ osu_upc_all_gather \ osu_upc_all_gather_all \ osu_upc_all_exchange \ osu_upc_all_broadcast \ osu_upc_all_barrier do echo "Compiling $bench..." upcc $bench.c ../util/osu_util_pgas.c ../util/osu_util.c -o $bench done The MPI Multiple Bandwidth / Message Rate (osu_mbw_mr), OpenSHMEM Put Message Rate (osu_oshm_put_mr), and OpenSHMEM Atomics (osu_oshm_atomics) tests are intended to be used with block assigned ranks. This means that all processes on the same machine are assigned ranks sequentially. Rank Block Cyclic ---------------------- 0 host1 host1 1 host1 host2 2 host1 host1 3 host1 host2 4 host2 host1 5 host2 host2 6 host2 host1 7 host2 host2 If you're using mpirun_rsh the ranks are assigned in the order they are seen in the hostfile or on the command line. Please see your process managers' documentation for information on how to control the distribution of the rank to host mapping. Point-to-Point MPI Benchmarks ----------------------------- osu_latency - Latency Test * The latency tests are carried out in a ping-pong fashion. The sender * sends a message with a certain data size to the receiver and waits for a * reply from the receiver. The receiver receives the message from the sender * and sends back a reply with the same data size. Many iterations of this * ping-pong test are carried out and average one-way latency numbers are * obtained. Blocking version of MPI functions (MPI_Send and MPI_Recv) are * used in the tests. This test is available here. osu_latency_mt - Multi-threaded Latency Test * The multi-threaded latency test performs a ping-pong test with a single * sender process and multiple threads on the receiving process. In this test * the sending process sends a message of a given data size to the receiver * and waits for a reply from the receiver process. The receiving process has * a variable number of receiving threads (set by default to 2), where each * thread calls MPI_Recv and upon receiving a message sends back a response * of equal size. Many iterations are performed and the average one-way * latency numbers are reported. This test is available here. * "-t" option can be used to set the number of sender and receiver threads to be used in a benchmark. Examples: -t 4 // receiver threads = 4 and sender threads = 1 -t 4:6 // sender threads = 4 and receiver threads = 6 -t 2: // not defined osu_latency_mp - Multi-process Latency Test * The multi-process latency test performs a ping-pong test with a single * sender process and a single receiver process, both having one or more * child processes that are spawned using the fork() system call. In this test * the sending process(parent) sends a message of a given data size to the * receiver(parent) process and waits for a reply from the receiver process. * Both the sending and receiving process have a variable number of child * processes (set by default to 1 child process), where each child process * sleeps for 2 seconds after the fork call and exits. The parent processes * carry out the ping-pong test where many iterations are performed and the * average one-way latency numbers are reported. This test is available here. * "-t" option can be used to set the number of sender and receiver processes * including the parent processes to be used in a benchmark. * * The purpose of this test is to check if the underlying MPI communication * runtime has taken care of fork safety even if the application has not. * * A new environment variable "MV2_SUPPORT_FORK_SAFETY" was introduced with * MVAPICH2 2.3.4 to make MVAPICH2 takes care of fork safety for * applications that require it. * * The support for fork safety is disabled by default in MVAPICH2 due to * performance reasons. When running osu_latency_mp with MVAPICH2, set * the environment variable MV2_SUPPORT_FORK_SAFETY to 1. When running * osu_latency_mp with other MPI libraries that do not support fork safety, * set the environment variables RDMAV_FORK_SAFE or IBV_FORK_SAFE to 1. * Examples: -t 4 // receiver processes = 4 and sender processes = 1 -t 4:6 // sender processes = 4 and receiver processes = 6 -t 2: // not defined osu_bw - Bandwidth Test * The bandwidth tests were carried out by having the sender sending out a * fixed number (equal to the window size) of back-to-back messages to the * receiver and then waiting for a reply from the receiver. The receiver * sends the reply only after receiving all these messages. This process is * repeated for several iterations and the bandwidth is calculated based on * the elapsed time (from the time sender sends the first message until the * time it receives the reply back from the receiver) and the number of bytes * sent by the sender. The objective of this bandwidth test is to determine * the maximum sustained date rate that can be achieved at the network level. * Thus, non-blocking version of MPI functions (MPI_Isend and MPI_Irecv) were * used in the test. This test is available here. osu_bibw - Bidirectional Bandwidth Test * The bidirectional bandwidth test is similar to the bandwidth test, except * that both the nodes involved send out a fixed number of back-to-back * messages and wait for the reply. This test measures the maximum * sustainable aggregate bandwidth by two nodes. This test is available here. osu_mbw_mr - Multiple Bandwidth / Message Rate Test * The multi-pair bandwidth and message rate test evaluates the aggregate * uni-directional bandwidth and message rate between multiple pairs of * processes. Each of the sending processes sends a fixed number of messages * (the window size) back-to-back to the paired receiving process before * waiting for a reply from the receiver. This process is repeated for * several iterations. The objective of this benchmark is to determine the * achieved bandwidth and message rate from one node to another node with a * configurable number of processes running on each node. The test is * available here. osu_multi_lat - Multi-pair Latency Test * This test is very similar to the latency test. However, at the same * instant multiple pairs are performing the same test simultaneously. * In order to perform the test across just two nodes the hostnames must * be specified in block fashion. Collective MPI Benchmarks ------------------------- osu_allgather - MPI_Allgather Latency Test(*) osu_allgatherv - MPI_Allgatherv Latency Test osu_allreduce - MPI_Allreduce Latency Test osu_alltoall - MPI_Alltoall Latency Test osu_alltoallv - MPI_Alltoallv Latency Test osu_barrier - MPI_Barrier Latency Test osu_bcast - MPI_Bcast Latency Test osu_gather - MPI_Gather Latency Test(*) osu_gatherv - MPI_Gatherv Latency Test osu_reduce - MPI_Reduce Latency Test osu_reduce_scatter - MPI_Reduce_scatter Latency Test osu_scatter - MPI_Scatter Latency Test(*) osu_scatterv - MPI_Scatterv Latency Test Collective Latency Tests * The latest OMB version includes benchmarks for various MPI blocking * collective operations (MPI_Allgather, MPI_Alltoall, MPI_Allreduce, * MPI_Barrier, MPI_Bcast, MPI_Gather, MPI_Reduce, MPI_Reduce_Scatter, * MPI_Scatter and vector collectives). These benchmarks work in the * following manner. Suppose users run the osu_bcast benchmark with N * processes, the benchmark measures the min, max and the average latency of * the MPI_Bcast collective operation across N processes, for various * message lengths, over a large number of iterations. In the default * version, these benchmarks report the average latency for each message * length. Additionally, the benchmarks offer the following options: * "-f" can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations. * "-m" option can be used to set the minimum and maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths. Examples: -m 128 // min = default, max = 128 -m 2:128 // min = 2, max = 128 -m 2: // min = 2, max = default * "-x" can be used to set the number of warmup iterations to skip for each message length. * "-i" can be used to set the number of iterations to run for each message length. * "-M" can be used to set per process maximum memory consumption. By default the benchmarks are limited to 512MB allocations. Support for CUDA Managed Memory --------------------------------- The following benchmarks have been extended to evaluate performance of MPI communication from and to buffers allocated using CUDA Managed Memory. * osu_bibw - Bidirectional Bandwidth Test * osu_bw - Bandwidth Test * osu_latency - Latency Test * osu_mbw_mr - Multiple Bandwidth / Message Rate Test * osu_multi_lat - Multi-pair Latency Test * osu_allgather - MPI_Allgather Latency Test * osu_allgatherv - MPI_Allgatherv Latency Test * osu_allreduce - MPI_Allreduce Latency Test * osu_alltoall - MPI_Alltoall Latency Test * osu_alltoallv - MPI_Alltoallv Latency Test * osu_bcast - MPI_Bcast Latency Test * osu_gather - MPI_Gather Latency Test * osu_gatherv - MPI_Gatherv Latency Test * osu_reduce - MPI_Reduce Latency Test * osu_reduce_scatter - MPI_Reduce_scatter Latency Test * osu_scatter - MPI_Scatter Latency Test * osu_scatterv - MPI_Scatterv Latency Test In addition to support for communications to and from GPU memories allocated using CUDA or OpenACC, we now provide additional capability of performing communications to and from buffers allocated using the CUDA Managed Memory concept. CUDA Managed (or Unified) Memory allows applications to allocate memory on either CPU or GPU memories using the cudaMallocManaged() call. This allows user oblivious transfer of the memory buffer between the CPU or GPU. Currently, we offer benchmarking with CUDA Managed Memory using the tests mentioned above. These benchmarks have additional options: * "M" allocates a send or receive buffer as managed for point to point communication. * "-d managed" uses managed memory buffers to perform collective communications. Non-Blocking Collective MPI Benchmarks -------------------------------------- osu_iallgather - MPI_Iallgather Latency Test osu_iallgatherv - MPI_Iallgatherv Latency Test osu_iallreduce - MPI_Iallreduce Latency Test osu_ialltoall - MPI_Ialltoall Latency Test osu_ialltoallv - MPI_Ialltoallv Latency Test osu_ialltoallw - MPI_Ialltoallw Latency Test osu_ibarrier - MPI_Ibarrier Latency Test osu_ibcast - MPI_Ibcast Latency Test osu_igather - MPI_Igather Latency Test osu_igatherv - MPI_Igatherv Latency Test osu_ireduce - MPI_Ireduce Latency Test osu_iscatter - MPI_Iscatter Latency Test osu_iscatterv - MPI_Iscatterv Latency Test Non-Blocking Collective Latency Tests * In addition to the blocking collective latency tests, we provide several * non-blocking collectives as mentioned above. These evaluate the same * metrics as the blocking operations as well as the additional metric * `overlap'. This is defined as the amount of computation that can be * performed while the communication progresses in the background. * These benchmarks have the additional option: * "-t" set the number of MPI_Test() calls during the dummy computation, set CALLS to 100, 1000, or any number > 0. One-sided MPI Benchmarks ------------------------ osu_put_latency - Latency Test for Put with Active/Passive Synchronization * The put latency benchmark includes window initialization operations * (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and * synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, * MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, * MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active * synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, * the origin process calls MPI_Put to directly place data of a certain size * in the remote process's window and then waiting on a synchronization call * (MPI_Win_complete) for completion. The remote process participates in * synchronization with MPI_Win_post and MPI_Win_wait calls. Several * iterations of this test is carried out and the average put latency * numbers is reported. The latency includes the synchronization time also. * For passive synchronization, suppose users run with MPI_Win_lock/unlock, * the origin process calls MPI_Win_lock to lock the target process's window * and calls MPI_Put to directly place data of certain size in the window. * Then it calls MPI_Win_unlock to ensure completion of the Put and release * lock on the window. This is carried out for several iterations and the * average time for MPI_Lock + MPI_Put + MPI_Unlock calls is measured. The * default window initialization and synchronization operations are * MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following * options: * "-w create" use MPI_Win_create to create an MPI Window object. * "-w allocate" use MPI_Win_allocate to create an MPI Window object. * "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window * object. * "-s lock" use MPI_Win_lock/unlock synchronizations calls. * "-s flush" use MPI_Win_flush synchronization call. * "-s flush_local" use MPI_Win_flush_local synchronization call. * "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. * "-s pscw" use Post/Start/Complete/Wait synchronization calls. * "-s fence" use MPI_Win_fence synchronization call. * "-x" can be used to set the number of warmup iterations to skip for each message length. * "-i" can be used to set the number of iterations to run for each message length. osu_get_latency - Latency Test for Get with Active/Passive Synchronization * The get latency benchmark includes window initialization operations * (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and * synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, * MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, * MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active * synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, * the origin process calls MPI_Get to directly fetch data of a certain size * from the target process's window into a local buffer. It then waits on a * synchronization call (MPI_Win_complete) for local completion of the Gets. * The remote process participates in synchronization with MPI_Win_post and * MPI_Win_wait calls. Several iterations of this test is carried out and * the average get latency numbers is reported. The latency includes the * synchronization time also. For passive synchronization, suppose users run * with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock * the target process's window and calls MPI_Get to directly read data of * certain size from the window. Then it calls MPI_Win_unlock to ensure * completion of the Get and releases lock on remote window. This is carried * out for several iterations and the average time for MPI_Lock + MPI_Get + * MPI_Unlock calls is measured. The default window initialization and * synchronization operations are MPI_Win_allocate and MPI_Win_flush. The * benchmark offers the following options: * "-w create" use MPI_Win_create to create an MPI Window object. * "-w allocate " use MPI_Win_allocate to create an MPI Window object. * "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window * object. * "-s lock" use MPI_Win_lock/unlock synchronizations calls. * "-s flush" use MPI_Win_flush synchronization call. * "-s flush_local" use MPI_Win_flush_local synchronization call. * "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. * "-s pscw" use Post/Start/Complete/Wait synchronization calls. * "-s fence" use MPI_Win_fence synchronization call. osu_put_bw - Bandwidth Test for Put with Active/Passive Synchronization * The put bandwidth benchmark includes window initialization operations * (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and * synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, * MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, * MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active * synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, * the test is carried out by the origin process calling a fixed number of * back-to-back MPI_Puts on remote window and then waiting on a * synchronization call (MPI_Win_complete) for their completion. The remote * process participates in synchronization with MPI_Win_post and * MPI_Win_wait calls. This process is repeated for several iterations and * the bandwidth is calculated based on the elapsed time and the number of * bytes put by the origin process. For passive synchronization, suppose * users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock * to lock the target process's window and calls a fixed number of * back-to-back MPI_Puts to directly place data in the window. Then it calls * MPI_Win_unlock to ensure completion of the Puts and release lock on * remote window. This process is repeated for several iterations and the * bandwidth is calculated based on the elapsed time and the number of bytes * put by the origin process. The default window initialization and * synchronization operations are MPI_Win_allocate and MPI_Win_flush. The * benchmark offers the following options: * "-w create" use MPI_Win_create to create an MPI Window object. * "-w allocate" use MPI_Win_allocate to create an MPI Window object. * "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window * object. * "-s lock" use MPI_Win_lock/unlock synchronizations calls. * "-s flush" use MPI_Win_flush synchronization call. * "-s flush_local" use MPI_Win_flush_local synchronization call. * "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. * "-s pscw" use Post/Start/Complete/Wait synchronization calls. * "-s fence" use MPI_Win_fence synchronization call. osu_get_bw - Bandwidth Test for Get with Active/Passive Synchronization * The get bandwidth benchmark includes window initialization operations * (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and * synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, * MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, * MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active * synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, * the test is carried out by origin process calling a fixed number of * back-to-back MPI_Gets and then waiting on a synchronization call * (MPI_Win_complete) for their completion. The remote process participates * in synchronization with MPI_Win_post and MPI_Win_wait calls. This process * is repeated for several iterations and the bandwidth is calculated based * on the elapsed time and the number of bytes received by the origin * process. For passive synchronization, suppose users run with * MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the * target process's window and calls a fixed number of back-to-back MPI_Gets * to directly get data from the window. Then it calls MPI_Win_unlock to * ensure completion of the Gets and release lock on the window. This * process is repeated for several iterations and the bandwidth is * calculated based on the elapsed time and the number of bytes read by the * origin process. The default window initialization and synchronization * operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers * the following options: * "-w create" use MPI_Win_create to create an MPI Window object. * "-w allocate" use MPI_Win_allocate to create an MPI Window object. * "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window * object. * "-s lock" use MPI_Win_lock/unlock synchronizations calls. * "-s flush" use MPI_Win_flush synchronization call. * "-s flush_local" use MPI_Win_flush_local synchronization call. * "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. * "-s pscw" use Post/Start/Complete/Wait synchronization calls. * "-s fence" use MPI_Win_fence synchronization. osu_put_bibw - Bi-directional Bandwidth Test for Put with Active Synchronization * The put bi-directional bandwidth benchmark includes window initialization * operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) * and synchronization operations (MPI_Win_Post/Start/Complete/Wait and * MPI_Win_fence). This test is similar to the bandwidth test, except that * both the processes involved send out a fixed number of back-to-back * MPI_Puts and wait for their completion. This test measures the maximum * sustainable aggregate bandwidth by two processes. The default window * initialization and synchronization operations are MPI_Win_allocate and * MPI_Win_Post/Start/Complete/Wait. The benchmark offers the following * options: * "-w create" use MPI_Win_create to create an MPI Window object. * "-w allocate" use MPI_Win_allocate to create an MPI Window object. * "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window * object. * "-s pscw" use Post/Start/Complete/Wait synchronization calls. * "-s fence" use MPI_Win_fence synchronization call. osu_acc_latency - Latency Test for Accumulate with Active/Passive Synchronization * The accumulate latency benchmark includes window initialization * operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) * and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, * MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, * MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active * synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, * the origin process calls MPI_Accumulate to combine data from the local * buffer with the data in the remote window and store it in the remote * window. The combining operation used in the test is MPI_SUM. The origin * process then waits on a synchronization call (MPI_Win_complete) for * completion of the operations. The remote process waits on a MPI_Win_wait * call. Several iterations of this test are carried out and the average * accumulate latency number is obtained. The latency includes the * synchronization time also. For passive synchronization, suppose users * run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to * lock the target process's window and calls MPI_Accumulate to combine data * from a local buffer with the data in the remote window and store it in * the remote window. Then it calls MPI_Win_unlock to ensure completion of * the Accumulate and release lock on the window. This is carried out for * several iterations and the average time for MPI_Lock + MPI_Accumulate + * MPI_Unlock calls is measured. The default window initialization and * synchronization operations are MPI_Win_allocate and MPI_Win_flush. The * benchmark offers the following options: * "-w create" use MPI_Win_create to create an MPI Window object. * "-w allocate" use MPI_Win_allocate to create an MPI Window object. * "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window * object. * "-s lock" use MPI_Win_lock/unlock synchronizations calls. * "-s flush" use MPI_Win_flush synchronization call. * "-s flush_local" use MPI_Win_flush_local synchronization call. * "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. * "-s pscw" use Post/Start/Complete/Wait synchronization calls. * "-s fence" use MPI_Win_fence synchronization call. osu_cas_latency - Latency Test for Compare and Swap with Active/Passive Synchronization * The Compare_and_swap latency benchmark includes window initialization * operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) * and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, * MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, * MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active * synchronization, suppose users run with * MPI_Win_Post/Start/Complete/Wait,the origin process calls * MPI_Compare_and_swap to place one element from origin buffer to target * buffer. The initial value in the target buffer is returned to the * calling process. The origin process then waits on a synchronization call * (MPI_Win_complete) for local completion of the operations. The remote * process waits on a MPI_Win_wait call. Several iterations of this test are * carried out and the average Compare_and_swap latency number is obtained. * The latency includes the synchronization time also. For passive * synchronization, suppose users run with MPI_Win_lock/unlock, the origin * process calls MPI_Win_lock to lock the target process's window and calls * MPI_Compare_and_swap to place one element from origin buffer to target * buffer. The initial value in the target buffer is returned to the calling * process. Then it calls MPI_Win_flush to ensure completion of the * Compare_and_swap. In the end, it calls MPI_Win_unlock to release lock on * the window. This is carried out for several iterations and the average * time for MPI_Compare_and_swap + MPI_Win_flush calls is measured. The * default window initialization and synchronization operations are * MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following * options: * "-w create" use MPI_Win_create to create an MPI Window object. * "-w allocate" use MPI_Win_allocate to create an MPI Window object. * "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window * object. * "-s lock" use MPI_Win_lock/unlock synchronizations calls. * "-s flush" use MPI_Win_flush synchronization call. * "-s flush_local" use MPI_Win_flush_local synchronization call. * "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. * "-s pscw" use Post/Start/Complete/Wait synchronization calls. * "-s fence" use MPI_Win_fence synchronization call. osu_fop_latency - Latency Test for Fetch and Op with Active/Passive Synchronization * The Fetch_and_op latency benchmark includes window initialization * operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) * and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, * MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, * MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active * synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, * the origin process calls MPI_Fetch_and_op to increase the element in * target buffer by 1. The initial value from the target buffer is returned * to the calling process. The origin process waits on a synchronization * call (MPI_Win_complete) for completion of the operations. The remote * process waits on a MPI_Win_wait call. Several iterations of this test are * carried out and the average Fetch_and_op latency number is obtained. The * latency includes the synchronization time also. For passive * synchronization, suppose users run with MPI_Win_lock/unlock, the origin * process calls MPI_Win_lock to lock the target process's window and calls * MPI_Compare_and_swap to place one element from origin buffer to target * buffer. The initial value in the target buffer is returned to the calling * process. Then it calls MPI_Win_flush to ensure completion of the * Compare_and_swap. In the end, it calls MPI_Win_unlock to release lock on * the window. This is carried out for several iterations and the average * time for MPI_Compare_and_swap + MPI_Win_flush calls is measured. The * default window initialization and synchronization operations are * MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following * options: * "-w create" use MPI_Win_create to create an MPI Window object. * "-w allocate" use MPI_Win_allocate to create an MPI Window object. * "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window * object. * "-s lock" use MPI_Win_lock/unlock synchronizations calls. * "-s flush" use MPI_Win_flush synchronization call. * "-s flush_local" use MPI_Win_flush_local synchronization call. * "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. * "-s pscw" use Post/Start/Complete/Wait synchronization calls. * "-s fence" use MPI_Win_fence synchronization call. osu_get_acc_latency - Latency Test for Get_accumulate with Active/Passive Synchronization * The Get_accumulate latency benchmark includes window initialization * operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) * and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush, * MPI_Win_flush_local, MPI_Win_lock_all/unlock_all, * MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active * synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait, * the origin process calls MPI_Get_accumulate to combine data from the * local buffer with the data in the remote window and store it in the * remote window. The combining operation used in the test is MPI_SUM. The * initial value from the target buffer is returned to the calling process. * The origin process waits on a synchronization call (MPI_Win_complete) for * local completion of the operations. The remote process waits on a * MPI_Win_wait call. Several iterations of this test are carried out and * the average get accumulate latency number is obtained. The latency * includes the synchronization time also. For passive synchronization, * suppose users run with MPI_Win_lock/unlock, the origin process calls * MPI_Win_lock to lock the target process's window and calls * MPI_Get_accumulate to combine data from a local buffer with the data in * the remote window and store it in the remote window. The initial value * from the target buffer is returned to the calling process. Then it calls * MPI_Win_unlock to ensure completion of the Get_accumulate and release * lock on the window. This is carried out for several iterations and the * average time for MPI_Lock + MPI_Get_accumulate + MPI_Unlock calls is * measured. The default window initialization and synchronization * operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers * the following options: * "-w create" use MPI_Win_create to create an MPI Window object. * "-w allocate" use MPI_Win_allocate to create an MPI Window object. * "-w dynamic" use MPI_Win_create_dynamic to create an MPI Window * object. * "-s lock" use MPI_Win_lock/unlock synchronizations calls. * "-s flush" use MPI_Win_flush synchronization call. * "-s flush_local" use MPI_Win_flush_local synchronization call. * "-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls. * "-s pscw" use Post/Start/Complete/Wait synchronization calls. * "-s fence" use MPI_Win_fence synchronization call. Point-to-Point OpenSHMEM Benchmarks ----------------------------------- osu_oshm_put.c - Latency Test for OpenSHMEM Put Routine * This benchmark measures latency of a shmem putmem operation for different * data sizes. The user is required to select whether the communication * buffers should be allocated in global memory or heap memory, through a * parameter. The test requires exactly two PEs. PE 0 issues shmem putmem to * write data at PE 1 and then calls shmem quiet. This is repeated for a * fixed number of iterations, depending on the data size. The average * latency per iteration is reported. A few warm-up iterations are run * without timing to ignore any start-up overheads. Both PEs call shmem * barrier all after the test for each message size. osu_oshm_put_nb.c - Latency Test for OpenSHMEM Non-blocking Put Routine * This benchmark measures the non-blocking latency of a shmem putmem_nbi * operation for different data sizes. The user is required to select * whether the communication buffers should be allocated in global * memory or heap memory, through a parameter. The test requires exactly * two PEs. PE 0 issues shmem putmem_nbi to write data at PE 1 and then calls * shmem quiet. This is repeated for a fixed number of iterations, depending * on the data size. The average latency per iteration is reported. * A few warm-up iterations are run without timing to ignore any start-up * overheads. Both PEs call shmem barrier all after the test for each message size. osu_oshm_get.c - Latency Test for OpenSHMEM Get Routine * This benchmark is similar to the one above except that PE 0 does a shmem * getmem operation to read data from PE 1 in each iteration. The average * latency per iteration is reported. osu_oshm_get_nb.c - Latency Test for OpenSHMEM Non-blocking Get Routine * This benchmark is similar to the one above except that PE 0 does a shmem * getmem_nbi operation to read data from PE 1 in each iteration. The average * latency per iteration is reported. osu_oshm_put_mr.c - Message Rate Test for OpenSHMEM Put Routine * This benchmark measures the aggregate uni-directional operation rate of * OpenSHMEM Put between pairs of PEs, for different data sizes. The user * should select for communication buffers to be in global memory and heap * memory as with the earlier benchmarks. This test requires number of PEs * to be even. The PEs are paired with PE 0 pairing with PE n/2 and so on, * where n is the total number of PEs. The first PE in each pair issues * back-to-back shmem putmem operations to its peer PE. The total time for * the put operations is measured and operation rate per second is reported. * All PEs call shmem barrier all after the test for each message size. osu_oshm_put_mr_nb.c - Message Rate Test for Non-blocking OpenSHMEM Put Routine * This benchmark measures the aggregate uni-directional operation rate of * OpenSHMEM Non-blocking Put between pairs of PEs, for different data sizes. * The user should select for communication buffers to be in global memory * and heap memory as with the earlier benchmarks. This test requires number * of PEs to be even. The PEs are paired with PE 0 pairing with PE n/2 and so on, * where n is the total number of PEs. The first PE in each pair issues * back-to-back shmem putmem_nbi operations to its peer PE until the window * size. A call to shmem_quite is placed after the window loop to ensure * completion of the issued operations. The total time for the non-blocking * put operations is measured and operation rate per second is reported. * All PEs call shmem barrier all after the test for each message size. osu_oshm_get_mr_nb.c - Message Rate Test for Non-blocking OpenSHMEM Get Routine * This benchmark measures the aggregate uni-directional operation rate of * OpenSHMEM Non-blocking Get between pairs of PEs, for different data sizes. * The user should select for communication buffers to be in global memory * and heap memory as with the earlier benchmarks. This test requires number * of PEs to be even. The PEs are paired with PE 0 pairing with PE n/2 and so on, * where n is the total number of PEs. The first PE in each pair issues * back-to-back shmem getmem_nbi operations to its peer PE until the window * size. A call to shmem_quite is placed after the window loop to ensure * completion of the issued operations. The total time for the non-blocking * put operations is measured and operation rate per second is reported. * All PEs call shmem barrier all after the test for each message size. osu_oshm_put_overlap.c - Non-blocking Message Rate Overlap Test * This benchmark measures the aggregate uni-directional operations rate * overlap for OpenSHMEM Put between paris of PEs, for different data sizes. * The user should select for communication buffers to be in global memory * and heap memory as with the earlier benchmarks. This test requires number * of PEs. The benchmarks prints statistics for different phases of * communication, computation and overlap in the end. osu_oshm_atomics.c - Latency and Operation Rate Test for OpenSHMEM Atomics Routines * This benchmark measures the performance of atomic fetch-and-operate and * atomic operate routines supported in OpenSHMEM for the integer * and long datatypes. The buffers can be selected to be in heap memory or global * memory. The PEs are paired like in the case of Put Operation Rate * benchmark and the first PE in each pair issues back-to-back atomic * operations of a type to its peer PE. The average latency per atomic * operation and the aggregate operation rate are reported. This is * repeated for each of fadd, finc, add, inc, cswap, swap, set, and fetch * routines. Collective OpenSHMEM Benchmarks ------------------------------- osu_oshm_collect - OpenSHMEM Collect Latency Test osu_oshm_fcollect - OpenSHMEM FCollect Latency Test osu_oshm_broadcast - OpenSHMEM Broadcast Latency Test osu_oshm_reduce - OpenSHMEM Reduce Latency Test osu_oshm_barrier - OpenSHMEM Barrier Latency Test Collective Latency Tests * The latest OMB Version includes benchmarks for various OpenSHMEM * collective operations (shmem_collect, shmem_broadcast, shmem_reduce and * shmem_barrier). These benchmarks work in the following manner. Suppose * users run the osu_oshm_broadcast benchmark with N processes, the * benchmark measures the min, max and the average latency of the * shmem_broadcast collective operation across N processes, for various * message lengths, over a large number of iterations. In the default * version, these benchmarks report the average latency for each message * length. Additionally, the benchmarks offer the following options: * "-f" can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations. * "-m" option can be used to set the maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths. * "-i" can be used to set the number of iterations to run for each message length. Point-to-Point UPC Benchmarks ----------------------------- osu_upc_memput.c - Put Latency * This benchmark measures the latency of upc put operation between multiple * UPC threads. In this bench- mark, UPC threads with ranks less than * (THREADS/2) issues upc memput operations to peer UPC threads. Peer * threads are identified as (MYTHREAD+THREADS/2). This is repeated for a * fixed number of iterations, for varying data sizes. The average latency * per iteration is reported. A few warm-up iterations are run without * timing to ignore any start-up overheads. All UPC threads call upc barrier * after the test for each message size. osu_upc_memget.c - Get Latency * This benchmark is similar as the osu upc put benchmark that is described * above. The difference is that the shared string handling function is upc * memget. The average get operation latency per iteration is reported. Collective UPC Benchmarks ------------------------- osu_upc_all_barrier - UPC Barrier Latency Test osu_upc_all_broadcast - UPC Broadcast Latency Test osu_upc_all_scatter - UPC Scatter Latency Test osu_upc_all_gather - UPC Gather Latency Test osu_upc_all_gather_all - UPC GatherAll Latency Test osu_upc_all_reduce - UPC Reduce Latency Test osu_upc_all_exchange - UPC Exchange Latency Test Collective Latency Tests * The latest OMB Version includes benchmarks for various UPC collective * operations (upc_all_barrier, upc_all_broadcast, upc_all_scatter, * upc_all_gather, upc_all_gather_all, osu_upc_all_reduce, and * upc_all_exchange). These benchmarks work in the following manner. Suppose * users run the osu_upc_all_broadcast benchmark with N processes, the * benchmark measures the min, max and the average latency of the * upc_all_broadcast collective operation across N processes, for various * message lengths, over a large number of iterations. In the default * version, these benchmarks report the average latency for each message * length. Additionally, the benchmarks offer the following options: "-f" * can be used to report additional statistics of the benchmark, such as min * and max latencies and the number of iterations. "-m" option can be used * to set the maximum message length to be used in a benchmark. In the * default version, the benchmarks report the latencies for up to 1MB * message lengths. "-i" can be used to set the number of iterations to run * for each message length. Point-to-Point UPC++ Benchmarks ------------------------------- osu_upcxx_async_copy_put.c - Put Latency * This benchmark measures the latency of the UPC++ async_copy operation * between multiple UPC++ threads. In this benchmark, UPC+ threads with * ranks less than (THREADS/2) issues UPC++ async_copy from local to remote * memory on peer threads. Peer threads are identified as * (MYTHREAD+THREADS/2). This is repeated for a fixed number of iterations, * for varying data sizes. The average latency per iteration is reported. A * few warm-up iterations are run without timing to ignore any start-up * overheads. All UPC++ threads call barrier after the test for each message * size. osu_upcxx_async_copy_get.c - Get Latency * This benchmark is similar as the osu_upcxx_async_copy_put benchmark that * is described above. The difference is that the async_copy operation * copies from remote to local memory. The average get operation latency per * iteration is reported. Collective UPC++ Benchmarks --------------------------- osu_upcxx_allgather - UPC++ Allgather Latency Test osu_upcxx_alltoall - UPC++ Alltoall Latency Test osu_upcxx_bcast - UPC++ Broadcast Latency Test osu_upcxx_gather - UPC++ Gather Latency Test osu_upcxx_reduce - UPC++ Reduce Latency Test osu_upcxx_scatter - UPC++ Scatter Latency Test Collective Latency Tests * The latest OMB Version includes benchmarks for various UPC++ collective * operations (upcxx_allgather, upcxx_alltoall, upcxx_bcast, upcxx_gather, * upcxx_reduce, and upcxx_scatter). These benchmarks work in the following * manner. Suppose users run the osu_upcxx_bcast benchmark with N processes, * the benchmark measures the min, max and the average latency of the * upcxx_bcast collective operation across N processes, for various message * lengths, over a large number of iterations. In the default version, these * benchmarks report the average latency for each message length. * Additionally, the benchmarks offer the following options: * "-f" can be used to report additional statistics of the benchmark, such * as min and max latencies and the number of iterations. * "-m" option can be used to set the maximum message length to be used in a * benchmark. In the default version, the benchmarks report the latencies * for up to 1MB message lengths. * "-i" can be used to set the number of iterations to run for each message * length. Startup Benchmarks ------------------ osu_init.c - This benchmark measures the minimum, maximum, and average time * each process takes to complete MPI_Init. osu_hello.c - This is a simple hello world program. Users can take advantage of * this to time it takes for all processes to execute MPI_Init + * MPI_Finalize. * * Example: * - time mpirun_rsh -np 2 -hostfile hostfile osu_hello ROCm, CUDA and OpenACC Extensions to OMB ---------------------------------------- CUDA Extensions to OMB can be enable by configuring the benchmark suite with --enable-cuda option as shown below. ./configure CC=/path/to/mpicc CXX=/path/to/mpicxx --enable-cuda --with-cuda-include=/path/to/cuda/include --with-cuda-libpath=/path/to/cuda/lib make make install ROCm extensions can be enabled by configuring OMB with --enable-rocm option as shown below. Similarly OpenACC extensions can be enabled using --enable-openacc option. The MPI library used should be able to support MPI communication from buffers in GPU Device memory. ./configure CC=/path/to/mpicc CXX=/path/to/mpicxx --enable-rocm --with-rocm=/path/to/rocm/install make make install Similarly, OpenACC Extensions can be enabled by specifying the --enable-openacc option. The MPI library used should be able to support MPI communication from buffers in GPU Device memory. The following benchmarks have been extended to evaluate performance of MPI communication using buffers on AMD and NVIDIA GPU devices. osu_bibw - Bidirectional Bandwidth Test osu_bw - Bandwidth Test osu_latency - Latency Test osu_mbw_mr - Multiple Bandwidth / Message Rate Test osu_multi_lat - Multi-pair Latency Test osu_latency_mt - Multi-threaded Latency Test osu_latency_mp - Multi-process Latency Test osu_put_latency - Latency Test for Put osu_get_latency - Latency Test for Get osu_put_bw - Bandwidth Test for Put osu_get_bw - Bandwidth Test for Get osu_put_bibw - Bidirectional Bandwidth Test for Put osu_acc_latency - Latency Test for Accumulate osu_cas_latency - Latency Test for Compare and Swap osu_fop_latency - Latency Test for Fetch and Op osu_allgather - MPI_Allgather Latency Test osu_allgatherv - MPI_Allgatherv Latency Test osu_allreduce - MPI_Allreduce Latency Test osu_alltoall - MPI_Alltoall Latency Test osu_alltoallv - MPI_Alltoallv Latency Test osu_bcast - MPI_Bcast Latency Test osu_gather - MPI_Gather Latency Test osu_gatherv - MPI_Gatherv Latency Test osu_reduce - MPI_Reduce Latency Test osu_reduce_scatter - MPI_Reduce_scatter Latency Test osu_scatter - MPI_Scatter Latency Test osu_scatterv - MPI_Scatterv Latency Test osu_iallgather - MPI_Iallgather Latency Test osu_iallgatherv - MPI_Iallgatherv Latency Test osu_iallreduce - MPI_Iallreduce Latency Test osu_ialltoall - MPI_Ialltoall Latency Test osu_ialltoallv - MPI_Ialltoallv Latency Test osu_ialltoallw - MPI_Ialltoallw Latency Test osu_ibcast - MPI_Ibcast Latency Test osu_igather - MPI_Igather Latency Test osu_igatherv - MPI_Igatherv Latency Test osu_ireduce - MPI_Ireduce Latency Test osu_iscatter - MPI_Iscatter Latency Test osu_iscatterv - MPI_Iscatterv Latency Test If both CUDA and OpenACC support is enabled you can switch between the modes using the -d [cuda|openacc] option to the benchmarks. If ROCm support is enabled, you need to use -d rocm option to make the benchmarks use this feature. Whether a process allocates its communication buffers on the GPU device or on the host can be controlled at run-time. Use the -h option for more help. ./osu_latency -h Usage: osu_latency [options] [RANK0 RANK1] RANK0 and RANK1 may be `D' or `H' which specifies whether the buffer is allocated on the accelerator device or host memory for each mpi rank options: -d TYPE accelerator device buffers can be of TYPE `cuda' or `openacc' -h print this help message Each of the pt2pt benchmarks takes two input parameters. The first parameter indicates the location of the buffers at rank 0 and the second parameter indicates the location of the buffers at rank 1. The value of each of these parameters can be either 'H' or 'D' to indicate if the buffers are to be on the host or on the device respectively. When no parameters are specified, the buffers are allocated on the host. The collective benchmarks will use buffers allocated on the device if the -d option is used otherwise the buffers will be allocated on the host. Examples: - mpirun_rsh -np 2 -hostfile hostfile MV2_USE_CUDA=1 osu_latency D D - mpirun_rsh -np 2 -hostfile hostfile MV2_USE_ROCM=1 osu_latency D D In this run, the latency test allocates buffers at both rank 0 and rank 1 on the GPU devices. - mpirun_rsh -np 2 -hostfile hostfile MV2_USE_CUDA=1 osu_bw D H - mpirun_rsh -np 2 -hostfile hostfile MV2_USE_ROCM=1 osu_bw D H In this run, the bandwidth test allocates buffers at rank 0 on the GPU device and buffers at rank 1 on the host. Setting GPU affinity -------------------- GPU affinity for processes is set before MPI_Init is called in the benchmarks. The process rank on a node is normally used to do this and different MPI launchers expose this information through different environment variables. The benchmarks use an environment variable called LOCAL_RANK to get this information. Starting with OMB v5.4.4, the benchmarks automatically identify the process rank on a node for MVAPICH2 when launched with mpirun_rsh. However, a script like below can be used to export this environment variable when using OMB to work with other MPI launchers and libraries. #!/bin/bash export LOCAL_RANK=$MV2_COMM_WORLD_LOCAL_RANK exec $* A copy of this script is installed as get_local_rank alongside the benchmarks. It can be used as follows: mpirun_rsh -np 2 -hostfile hostfile MV2_USE_CUDA=1 get_local_rank \ ./osu_latency D D mpirun_rsh -np 2 -hostfile hostfile MV2_USE_ROCM=1 get_local_rank \ ./osu_latency D D