/scoria

scorIA

Primary LanguageCOtherNOASSERTION

ScorIA: Sparse Memory Acceleration Testbed

Description

Prototype testbed for various memory acceleration schemes focused on improving sparse memory accesses and scatter/gather performance through indirection arrays.

Client/Controller design to service sparse memory requests (0, 1, and 2 levels of indirection)

image

Installation

Pre-Requisites

  • CMake (Version >= 3.12)
  • C and C++ compilers (C++ compiler must support C++ 20 for the UME submodule, i.e. gcc 12.2.0 or newer)
  • OpenMP (for calibration test)
  • Python3 (for scripts)
  • MPI (for use with Ume client)
  • Caliper (for profiling Ume client)

Submodules

Dependencies

Included directly. Source code and build systems have been modified where needed to work with Scoria

Simple Build

Currently builds on Intel architectures. An Arm version is in progress.

git clone git@github.com:lanl/scoria.git
cd scoria
mkdir build
cd build
cmake ..
make

Full Build with Ume Serial and Ume MPI

git clone git@github.com:lanl/scoria.git
cd scoria
mkdir build
cd build
cmake -DUSE_MPI=ON ..
make

CMake Keywords

Ume

Option Description Default Status Compile Definitions (Pre-Processor)
USE_MPI Build with MPI OFF Complete
USE_CALIPER Build with Caliper OFF Complete

Intrinsics

Option Description Default Status Compile Definitions (Pre-Processor)
Scoria_REQUIRE_AVX Build with AVX 512 Support (-mavx512f) OFF Complete USE_AVX
Scoria_REQUIRE_SVE Build with SVE Support (-march=armv8.2-a) OFF In Progress USE_SVE

Core Options

Option Description Default Status Compile Definitions (Pre-Processor)
MAX_CLIENTS Maximum number of clients that can simultaneously connect to the controller 1 Complete MAX_CLIENTS
REQUEST_QUEUE_SIZE Size of the request queue for each client 100 Complete REQUEST_QUEUE_SIZE

Tests and Clients Options

Option Description Default Status Compile Definitions (Pre-Processor)
Scoria_REQUIRE_CLIENTS Build example clients located in clients directory ON Complete None
Scoria_REQUIRE_TESTS Build benchmark tests based on tests/test.c ON Complete None
Scoria_REQUIRE_CALIBRATION_TESTS Build calibration tests based on tests/calibration.c OFF Complete None
Scoria_REQUIRE_TIMING Build Scoria with internal timing + build tests to print internal results OFF Complete Scoria_REQUIRE_TIMING
Scoria_SCALE_BW Build Scoria and tests to account for indirection arrays when calculating bandwidth OFF Complete SCALE_BW
Scoria_SINGLE_ALLOC Build benchmark and calibration tests with single allocation policy OFF Complete SINGLE_ALLOC

Examples

Build Scoria with only bandwidth tests (no clients, no calibration tests and no AVX/SVE)

cmake -DScoria_REQUIRE_CLIENTS=OFF ..
make

The test and test_client executables should be in the tests directory, along with the scoria executable in the base build directory.

Build Scoria with both bandwidth and calibration tests (no clients and no AVX/SVE)

cmake -DScoria_REQUIRE_CALIBRATION_TESTS=ON -DScoria_REQUIRE_CLIENTS=OFF ..
make

The test, test_clients, test_calibration, and test_calibration_client executables should be in the tests directory, along with the scoria executable in the base build directory.

Build Scoria with bandwidth and calibration tests and clients (no AVX/SVE)

cmake -DScoria_REQUIRE_CALIBRATION_TESTS=ON ..

The test, test_clients, test_calibration, and test_calibration_client executables should be in the tests directory, the simple_client and spatter executables should be in the clients directory, along with the scoria executable in the base build directory.

Build Scoria with bandwidth and calibration tests and clients with AVX intrinsics, internal timing, and bandwidth scaling enabled, along with the ability to manage 4 client simultaneously

cmake -DScoria_REQUIRE_CALIBRATION_TESTS=ON -DScoria_REQUIRE_AVX=ON -DScoria_REQUIRE_TIMING=ON -DScoria_SCALE_BW=ON -DMAX_CLIENTS=4 ..
make

Build UME with AVX Extension and Scoria

cmake -DUSE_MPI=ON -DUSE_CALIPER -DREQUIRE_AVX
make

The ume_mpi and ume_serial executables should be in the clients/UME/src directory

Build UME + Scoria with Caliper Profiling and AVX

mkdir caliper_build
cd caliper_build
cmake -DUSE_CALIPER=ON -DUSE_MPI=ON -DScoria_REQUIRE_AVX=ON ..
make

The ume_mpi and ume_serial executables should be in the clients/UME/src directory

Build baseline (non-scoria) UME for Caliper profiling without AVX

cd clients/UME
mkdir caliper_build
cd caliper_build
cmake  -DUSE_CALIPER=ON -DUSE_MPI=ON ../
make

The ume_mpi and ume_serial executables should be in the src directory

The test, test_clients, test_calibration, and test_calibration_client executables should be in the tests directory, the simple_client and spatter executables should be in the clients directory, along with the scoria executable in the base build directory. The test_clients and test_calibration_client executables, when ran with the scoria controller, should now output both internal and external bandwidth measurements and timings.

Tests

Tests for 0, 1, and 2 levels of indirection are implemented. They come in the following flavors:

  • str uses straight access, meaning index a[i] = i for all levels of indirection (this is the only test availalbe for 0 levels of indirection).
  • A or noA denotes if aliases are included or not. If aliases are included, they are added before the shuffle stage (see below). For each index, a random number is drawn and if it's below the alias fraction, this index is inserted at a random position in the indirection indices. This is done for all levels of indirection.
  • F or C denotes full or clustered shuffle and aliases. Full shuffle means the indices are shuffled across the entire range and aliases, if used, are inserted across the entire range. In clustered mode, the shuffle and aliasing happens only within consequtive clusters of the given size. For example, say we have a cluster size S = 32, then the first cluster is indices 0 - 31 and aliases are within this group are added and only these indices are shuffled amongst themselves. The next cluster is 32 - 63, and any aliases added to this cluster are all indices within this cluster before they are shuffled amongst themselves.

Under the tests directory in the build directory, there are four executables. They are each ran by specifying the number of doubles we wish to test on: ./test 8388608

  • test runs the test suite without using the client and controller infrastructure; it just tests the kernls directly
  • test_client runs the tests as a client and communicates with the controller; a controller must thus be running
  • test_calibrate performs a STREAM-like benchmark for baselining and runs the 0-level indirection test without using the client and controller infrastructure; it just tests the kernels directly. Requires OpenMP for the STREAM-like benchmark.
  • test_calibrate_client performans a STREAM-like benchmark for baselining and runs the 0-level indirection test as a client and communcates with the controller; a controller must thus be running. Currently has experimental code to re-map pages to particular NUMA nodes. Requires OpenMP for the STREAM-like benchmark.

Clients

To add your own clients, use clients/simple/simple_client.c as a starting point. At a minimum you will need to intialize and cleanup the client as follows:

#include "scoria.h"

int main(int argc, char **argv) {
  struct client client;
  client.chatty = 0;

  scoria_init(&client);

  // Your code here  

  scoria_cleanup(&client);
  return 0;
}

Allocate usable shared memory between the client and Scoria with shm_malloc(size_t s)

double *A = shm_malloc(1024 * sizeof(double));

The following commands can be used to perform gathers (reads) or scatters (writes) with 0, 1, or 2 levels of indirection:

void scoria_write(struct client *client, void *buffer, const size_t N, const void *input, const size_t *ind1, const size_t *ind2, size_t num_threads, i_type intrinsics, struct request *req)

void scoria_read(struct client *client, const void *buffer, const size_t N, void *output, const size_t *ind1, const size_t *ind2, size_t num_threads, i_type intrinsics, struct request *req)

void scoria_quit(struct client *client, struct request *req)

The available intrinsics are: NONE, AVX, and SVE

Read and Write requests are handled asynchronously by Scoria. They can be completed using:

void wait_request(struct client *client, struct request *req)

Client Description Directory Status
Simple Minimal client that demonstrates read/write/quit using shared memory client/simple Complete
Spatter Microbenchmark for timing Gather/Scatter kernels Spatter client/spatter Complete
Minimal Spatter Minimal Spatter client that removes argtable and other dependencies client/minimal_spatter In Progress
Ume Flag Proxy which attempts to capture memory access patterns, kernels, and mesh structure Ume client/ume Complete
EAPPAT Memory access and iterations patterns from the EAP code base with the physics removed EAP Patterns client/eappat Coming Soon

Usage

Examples

Terminal window 1

./scoria

Terminal window 2

./tests/test_client 1048576

On nodes with multiple CPU sockets, bandwidth can be drastically reduced if the client and controller processes are bound to different NUMA nodes. To explicitly bind the processes to the same socket, use the following:

Terminal window 1

hwloc-bind node:0 ./scoria

Terminal window 2

hwloc-bind node:0 ./tests/test_client 1048576

Scripts

Note: To use the scripts, Scoria must have been built without internal timing, i.e. -DScoria_REQUIRE_TIMING=OFF

scripts/simple_test_bw.py contains a script to launch both Scoria and the test client. It is configurable with the following options:

Short Option Long Option Description Default
-l --logfile Logfile name client.log
-p --plotfile Plot file names (see plot_test_bw.py) bw.png
-n --size Number of doubles to pass to test_client 1048576
-s --bindscoria hwloc-bind options for Scoria None
-b --bindclient hwloc-bind options for test_client None

The output will be a log file with the bandwidth data and bar charts of the bandwidth for each test at each thread count. If AVX or SVE is enabled, those results will be saved to an individual figure with the appropriate name.

scripts/scoria-vs-ume.sh contains a script to build Ume + Scoria with AVX and Caliper enabled, and to build a standalone Ume executable with Caliper. It then runs both with an Ume input file of your choosing and with the specified number of ranks, and outputs profiling data in the form of a text file or a JSON file that can be read by Hatchet. It is configurable with the following options:

Short Option Description Default
-c (Optional) CALI_CONFIG setting runtime-report(output=report.log)
-f Absolute path to Input Deck for Ume None
-n (Optional) Number of ranks to use to launch MPI run 1
-s (Optional) Scoria root directory pwd
-p (Optional) List of PAPI Counters to collect None

Example

cd scoria
python3 scripts/simple_test_bw.py -l output.log -p scoria.png -n 8388608 -s node:0 -b node:0
bash scripts/scoria-vs-ume.sh -c "hatchet-region-profile" -n <num-ranks> -f <absolute-path-to-input-deck>
bash scripts/scoria-vs-ume.sh  -n <num-ranks> -f <absolute-path-to-input-deck>  -p "PAPI_DP_OPS,PAPI_TOT_CYC,PAPI_TOT_INS,PAPI_LD_INS,PAPI_SR_INS,PAPI_BR_INS,PAPI_LST_INS"

License

License

Triad National Security, LLC (Triad) owns the copyright to Scoria. The license is BSD-ish with a "modifications must be indicated" clause. See LICENSE for the full text.

Authors and acknowledgment