SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and TensorFlow, in High Performance Computing (HPC) simulations and applications.
SmartSim provides an API to connect HPC workloads, particularly (MPI + X) simulations, to an in-memory database called the Orchestrator, built on an in-memory database called Redis.
Applications integrated with the SmartRedis clients, written in Fortran, C, C++ and Python, can stream tensors and datasets to and from the Orchestrator. The distributed Client-Server paradigm allows for data to be seemlessly exchanged between applications at runtime.
In addition to exchanging data between langauges, any of the SmartRedis clients can remotely execute Machine Learning models and TorchScript code on data stored in the Orchestrator despite which language the data originated from.
SmartSim supports the following ML libraries.
Library | Supported Version |
---|---|
PyTorch | 1.7.1 |
TensorFlow\Keras | 2.4.2 |
ONNX | 1.7.0 |
A number of other libraries are supported through ONNX, like SciKit-Learn and XGBoost.
SmartSim is made up of two parts
- SmartSim Infrastructure Library (This repository)
- SmartRedis
The two library components are designed to work together, but can also be used independently.
Table of Contents
- SmartSim
- SmartSim Infrastructure Library
- Infrastructure Library Applications
- SmartRedis
- SmartSim + SmartRedis
- Publications
- Cite
The Infrastructure Library (IL), the smartsim
python package,
facilitates the launch of Machine Learning and simulation
workflows. The Python interface of the IL creates, configures, launches and monitors
applications.
The Experiment
object is the main interface of SmartSim. Through the Experiment
users can create references to applications called Models
.
Below is a simple example of a workflow that uses the IL to launch hello world program using the local launcher which is designed for laptops and single nodes.
from smartsim import Experiment
from smartsim.settings import RunSettings
exp = Experiment("simple", launcher="local")
settings = RunSettings("echo", exe_args="Hello World")
model = exp.create_model("hello_world", settings)
exp.start(model, block=True)
print(exp.get_status(model))
RunSettings
define how a model is launched. There are many types of RunSettings
supported by SmartSim.
RunSettings
MpirunSettings
SrunSettings
AprunSettings
JsrunSettings
For example, MpirunSettings
can be used to launch MPI programs with openMPI.
from smartsim import Experiment
from smartsim.settings import MpirunSettings
exp = Experiment("hello_world", launcher="local")
mpi = MpirunSettings(exe="echo", exe_args="Hello World!")
mpi.set_tasks(4)
mpi_model = exp.create_model("hello_world", mpi)
exp.start(mpi_model, block=True)
print(exp.get_status(model))
SmartSim integrates with common HPC schedulers providing batch and interactive launch capabilities for all applications.
- Slurm
- LSF
- PBSPro
- Cobalt
- Local (for laptops/single node, no batch)
The following launches the same hello_world
model in an interactive allocation
using the Slurm launcher. Jupyter/IPython notebooks, and scripts
# get interactive allocation
salloc -N 1 -n 32 --exclusive -t 00:10:00
# hello_world.py
from smartsim import Experiment
from smartsim.settings import SrunSettings
exp = Experiment("hello_world_exp", launcher="slurm")
srun = SrunSettings(exe="echo", exe_args="Hello World!")
srun.set_nodes(1)
srun.set_tasks(32)
model = exp.create_model("hello_world", srun)
exp.start(model, block=True, summary=True)
print(exp.get_status(model))
# in interactive terminal
python hello_world.py
This script could also be launched in a batch file instead of an interactive terminal.
#!/bin/bash
#SBATCH --exclusive
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --time=00:10:00
python /path/to/script.py
# on Slurm system
sbatch run_hello_world.sh
SmartSim can also launch workloads in a batch directly from Python, without the need
for a batch script. Users can launch groups of Model
instances in a Ensemble
.
The following launches 4 replicas of the the same hello_world
model.
# hello_ensemble.py
from smartsim import Experiment
from smartsim.settings import SrunSettings, SbatchSettings
exp = Experiment("hello_world_batch", launcher="slurm")
# define resources for all ensemble members
sbatch = SbatchSettings(nodes=4, time="00:10:00", account="12345-Cray")
sbatch.set_partition("premium")
# define how each member should run
srun = SrunSettings(exe="echo", exe_args="Hello World!")
srun.set_nodes(1)
srun.set_tasks(32)
ensemble = exp.create_ensemble("hello_world", batch_settings=sbatch,
run_settings=srun, replicas=4)
exp.start(ensemble, block=True, summary=True)
print(exp.get_status(ensemble))
# on Slurm system
python hello_ensemble.py
Here is the same example, but for PBS using AprunSettings
for running with aprun
.
MpirunSettings
could also be used in this example as openMPI supported by all the
launchers within SmartSim.
# hello_ensemble_pbs.py
from smartsim import Experiment
from smartsim.settings import AprunSettings, QsubBatchSettings
exp = Experiment("hello_world_batch", launcher="pbs")
# define resources for all ensemble members
qsub = QsubBatchSettings(nodes=4, time="00:10:00",
account="12345-Cray", queue="cl40")
# define how each member should run
aprun = AprunSettings(exe="echo", exe_args="Hello World!")
aprun.set_tasks(32)
ensemble = exp.create_ensemble("hello_world", batch_settings=qsub,
run_settings=aprun, replicas=4)
exp.start(ensemble, block=True, summary=True)
print(exp.get_status(ensemble))
# on PBS system
python hello_ensemble_pbs.py
- Orchestrator - In-memory data store and Machine Learning Inference (Redis + RedisAI)
The Orchestrator
is an in-memory database that utilizes Redis and RedisAI to provide
a distributed database and access to ML runtimes from Fortran, C, C++ and Python.
SmartSim provides classes that make it simple to launch the database in many configurations and optional form a distributed database cluster. The examples below will show how to launch the database. Later in this document we will show how to use the database to perform ML inference and processing.
The following script launches a single database using the local launcher.
# run_db_local.py
from smartsim import Experiment
from smartsim.database import Orchestrator
exp = Experiment("local-db", launcher="local")
db = Orchestrator(port=6780)
# by default, SmartSim never blocks execution after the database is launched.
exp.start(db)
# launch models, anaylsis, training, inference sessions, etc
# that communicate with the database using the SmartRedis clients
# stop the database
exp.stop(db)
The Orchestrator
, like Ensemble
instances, can be launched locally, in interactive
allocations, or in a batch.
The Orchestrator is broken into several classes to ease submission on HPC systems.
The following example launches a distributed (3 node) database cluster on a Slurm system from an interactive allocation terminal.
# get interactive allocation
salloc -N 3 --ntasks-per-node=1 --exclusive -t 00:10:00
# run_db_slurm.py
from smartsim import Experiment
from smartsim.database import SlurmOrchestrator
exp = Experiment("db-on-slurm", launcher="slurm")
db_cluster = SlurmOrchestrator(db_nodes=3, db_port=6780, batch=False)
exp.start(db_cluster)
print(f"Orchestrator launched on nodes: {db_cluster.hosts}")
# launch models, anaylsis, training, inference sessions, etc
# that communicate with the database using the SmartRedis clients
exp.stop(db_cluster)
# in interactive terminal
python run_db_slurm.py
Here is the same example on a PBS system
# get interactive allocation
qsub -l select=3:ppn=1 -l walltime=00:10:00 -q cl40 -I
# run_db_pbs.py
from smartsim import Experiment
from smartsim.database import PBSOrchestrator
exp = Experiment("db-on-slurm", launcher="slurm")
db_cluster = PBSOrchestrator(db_nodes=3, db_port=6780, batch=False)
exp.start(db_cluster)
print(f"Orchestrator launched on nodes: {db_cluster.hosts}")
# launch models, anaylsis, training, inference sessions, etc
# that communicate with the database using the SmartRedis clients
exp.stop(db_cluster)
# in interactive terminal
python run_db_pbs.py
The Orchestrator
can also be launched in a batch without the need for an interactive allocation.
SmartSim will create the batch file, submit it to the batch system, and then wait for the database
to be launched. Users can hit CTRL-C to cancel the launch if needed.
# run_db_pbs_batch.py
from smartsim import Experiment
from smartsim.database import PBSOrchestrator
exp = Experiment("db-on-slurm", launcher="pbs")
db_cluster = PBSOrchestrator(db_nodes=3, db_port=6780, batch=True,
time="00:10:00", account="12345-Cray", queue="cl40")
exp.start(db_cluster)
print(f"Orchestrator launched on nodes: {db_cluster.hosts}")
# launch models, anaylsis, training, inference sessions, etc
# that communicate with the database using the SmartRedis clients
exp.stop(db_cluster)
# on PBS system
python run_db_pbs_batch.py
The SmartSim IL Clients (SmartRedis) are implementations of Redis clients that implement the RedisAI API with additions specific to scientific workflows.
SmartRedis clients are available in Fortran, C, C++, and Python. Users can seamlessly pull and push data from the Orchestrator from different languages.
Tensors are the fundamental data structure for the SmartRedis clients. The Clients use the native array format of the language. For example, in Python, a tensor is a NumPy array. The C++/C client accepts nested and contingous arrays.
When stored in the database, all tensors are stored in the same format. Hence, any language can recieve a tensor from the database no matter what supported language the array was sent from. This enables applications in different languages to communicate numerical data with each other at runtime (coupling).
For more information on the tensor data structure, see the documentation
Datasets are collections of Tensors and associated metadata. The Dataset
class
is a user space object that can be created, added to, sent to, and retrieved from
the Orchestrator database.
For an example of how to use the Dataset
class, see the Online Analysis example
For more information on the API, see the API documentation
Even though the clients rely on the Orchestrator database to be running, it can be helpful to see examples of how the API is used accross different languages even without the infrastructure code. The following examples provide simple examples of client usage.
For more imformation on the SmartRedis clients, see the API documentation and tutorials.
Please note these are client examples, they will not run if there is no database to connect to.
Training code and Model construction are not shown here, but the example below shows how to take a PyTorch model, sent it to the database, and execute it on data stored within the database.
Notably the GPU argument is used to ensure that exection of the model takes place on a GPU if one is available to the database.
import torch
from smartredis import Client
net = create_mnist_cnn() # returns trained PyTorch nn.Module
client = Client(address="127.0.0.1:6780", cluster=False)
client.put_tensor("input", torch.rand(20, 1, 28, 28).numpy())
# put the PyTorch CNN in the database in GPU memory
client.set_model("cnn", net, "TORCH", device="GPU")
# execute the model, supports a variable number of inputs and outputs
client.run_model("cnn", inputs=["input"], outputs=["output"])
# get the output
output = client.get_tensor("output")
print(f"Prediction: {output}")
One common pattern is to use SmartSim to spin up the Orchestrator database and then use the Python client to set the model in the database. Once set, an application that uses the C, C++, or Fortran clients will call the model that was set.
This example shows the necessary code an application would need to include to execute a model (with any ML backend) that had been stored prior to application launch by the Python client.
#include "client.h"
// dummy tensor for brevity
// Initialize a vector that will hold input image tensor
size_t n_values = 1*1*28*28;
std::vector<float> img(n_values, 0)
// Declare keys that we will use in forthcoming client commands
std::string model_name = "cnn"; // from previous example
std::string in_key = "mnist_input";
std::string out_key = "mnist_output";
// Initialize a Client object
SmartRedis::Client client(false);
// Put the image tensor on the database
client.put_tensor(in_key, img.data(), {1,1,28,28},
SmartRedis::TensorType::flt,
SmartRedis::MemoryLayout::contiguous);
// Run model already in the database
client.run_model(model_name, {in_key}, {out_key});
// Get the result of the model
std::vector<float> result(1*10);
client.unpack_tensor(out_key, result.data(), {10},
SmartRedis::TensorType::flt,
SmartRedis::MemoryLayout::contiguous);
You can also load a model from file and put it in the database before you execute it. This example shows how this is done in Fortran.
program run_mnist_example
use smartredis_client, only : client_type
implicit none
character(len=*), parameter :: model_key = "mnist_model"
character(len=*), parameter :: model_file = "../../cpp/mnist_data/mnist_cnn.pt"
type(client_type) :: client
call client%initialize(.false.)
! Load pre-trained model into the Orchestrator database
call client%set_model_from_file(model_key, model_file, "TORCH", "GPU")
call run_mnist(client, model_key)
contains
subroutine run_mnist( client, model_name )
type(client_type), intent(in) :: client
character(len=*), intent(in) :: model_name
integer, parameter :: mnist_dim1 = 28
integer, parameter :: mnist_dim2 = 28
integer, parameter :: result_dim1 = 10
real, dimension(1,1,mnist_dim1,mnist_dim2) :: array
real, dimension(1,result_dim1) :: result
character(len=255) :: in_key
character(len=255) :: out_key
character(len=255), dimension(1) :: inputs
character(len=255), dimension(1) :: outputs
! Construct the keys used for the specifiying inputs and outputs
in_key = "mnist_input"
out_key = "mnist_output"
! Generate some fake data for inference
call random_number(array)
call client%put_tensor(in_key, array, shape(array))
inputs(1) = in_key
outputs(1) = out_key
call client%run_model(model_name, inputs, outputs)
result(:,:) = 0.
call client%unpack_tensor(out_key, result, shape(result))
end subroutine run_mnist
end program run_mnist_example
SmartSim and SmartRedis were designed to work together. When launched through SmartSim, applcations using the SmartRedis clients are directly connected to any Orchestrator launched in the same Experiment.
In this way, a SmartSim Experiment becomes a driver for coupled ML and Simulation workflows. The following are simple examples of how to use SmartSim and SmartRedis together.
Using SmartSim, HPC applications can be monitored in real time by streaming data from the application to the database. SmartRedis clients can retrieve the data, process, analyze it, and store the data in the database.
The following is an example of how a user could monitor and analyze a simulation. The example here uses the Python client, but SmartRedis clients are available in C++, C, and Fortran as well and implement the same API.
The example will produce the visualization below while the simulation is running.
lattice.mp4
Using a Lattice Boltzmann Simulation,
this example demonstrates how to use the SmartRedis Dataset
API to stream
data to the Orchestrator deployed by SmartSim.
The following code will show the peices of the simulation that are needed to transmit the data needed to plot timesteps of the simulation.
# fv_sim.py
from smartredis import Client
import numpy as np
# initialization code ommitted
# save cylinder location to database
cylinder = (X - x_res/4)**2 + (Y - y_res/2)**2 < (y_res/4)**2 # bool array
client.put_tensor("cylinder", cylinder.astype(np.int8))
for time_step in range(steps): # simulation loop
for i, cx, cy in zip(idxs, cxs, cys):
F[:,:,i] = np.roll(F[:,:,i], cx, axis=1)
F[:,:,i] = np.roll(F[:,:,i], cy, axis=0)
bndryF = F[cylinder,:]
bndryF = bndryF[:,[0,5,6,7,8,1,2,3,4]]
rho = np.sum(F, 2)
ux = np.sum(F * cxs, 2) / rho
uy = np.sum(F * cys, 2) / rho
Feq = np.zeros(F.shape)
for i, cx, cy, w in zip(idxs, cxs, cys, weights):
Feq[:,:,i] = rho * w * ( 1 + 3*(cx*ux+cy*uy) + 9*(cx*ux+cy*uy)**2/2 - 3*(ux**2+uy**2)/2 )
F += -(1.0/tau) * (F - Feq)
F[cylinder,:] = bndryF
# Create a SmartRedis dataset with vorticity data
dataset = Dataset(f"data_{str(time_step)}")
dataset.add_tensor("ux", ux)
dataset.add_tensor("uy", uy)
# Put Dataset in db at key "data_{time_step}"
client.put_dataset(dataset)
The driver that launches the database and the simulation (non-blocking), looks like:
# driver.py
time_steps, seed = 3000, 42
exp = Experiment("finite_volume_simulation", launcher="local")
db = Orchestrator(port=6780)
settings = RunSettings("python", exe_args=["fv_sim.py",
f"--seed={seed}",
f"--steps={time_steps}"])
model = exp.create_model("fv_simulation", settings)
model.attach_generator_files(to_copy="fv_sim.py")
exp.generate(db, model, overwrite=True)
exp.start(db)
client = Client(address="127.0.0.1:6780", cluster=False)
# start simulation (non-blocking)
exp.start(model, block=False, summary=True)
# poll until simulation starts and then retrieve data
client.poll_key("cylinder", 200, 100)
cylinder = client.get_tensor("cylinder").astype(bool)
for i in range(0, time_steps):
client.poll_key(f"data_{str(i)}", 10, 1000)
dataset = client.get_dataset(f"data_{str(i)}")
ux, uy = dataset.get_tensor("ux"), dataset.get_tensor("uy")
# analysis/plotting code omitted
exp.stop(db)
More details about online anaylsis with SmartSim and the full code examples can be found in the SmartSim documentation. #fix this
Each of the SmartRedis clients can be used to remotely execute TorchScript code on data stored within the database. The scripts/functions are executed in the Torch runtime linked into the database.
Any of the functions available in the TorchScript builtins can be saved as "script" or "functions" in the database and used directly by any of the SmartRedis Clients.
For example, the following code sends the built-in Singular Value Decomposition to the database and execute it on a dummy tensor.
import numpy as np
from smartredis import Client
# don't even need to import torch
def calc_svd(input_tensor):
return input_tensor.svd()
# connect a client to the database
client = Client(address="127.0.0.1:6780", cluster=False)
# get dummy data
tensor = np.random.randint(0, 100, size=(5, 3, 2)).astype(np.float32)
client.put_tensor("input", tensor)
client.set_function("svd", calc_svd)
client.run_script("svd", "calc_svd", "input", ["U", "S", "V"])
# results are not retrieved immediately in case they need
# to be fed to another function/model
U = client.get_tensor("U")
S = client.get_tensor("S")
V = client.get_tensor("V")
print(f"U: {U}, S: {S}, V: {V}")
The processing capabilties make it simple to form computational piplines of functions, scripts, and models.
See the full TorchScript Language Reference documentation for more information on available methods, functions, and how to create your own.
SmartSim supports the following frameworks for quering Machine Learning models from C, C++, Fortran and Python with the SmartRedis Clients:
Library | Supported Version |
---|---|
PyTorch | 1.7.1 |
TensorFlow\Keras | 2.4.2 |
ONNX | 1.7.0 |
Note, it's important to remember that SmartSim utilizes a client-server model. To run experiments that utilize the above frameworks, you must first start the Orchestrator database with SmartSim.
The example below shows how to spin up a database with SmartSim and invoke a PyTorch CNN model using the SmartRedis clients.
# simple_torch_inference.py
import io
import torch
import torch.nn as nn
from smartredis import Client
from smartsim import Experiment
from smartsim.database import Orchestrator
exp = Experiment("simple-online-inference", launcher="local")
db = Orchestrator(port=6780)
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv = nn.Conv2d(1, 1, 3)
def forward(self, x):
return self.conv(x)
torch_model = Net()
example_forward_input = torch.rand(1, 1, 3, 3)
module = torch.jit.trace(torch_model, example_forward_input)
model_buffer = io.BytesIO()
torch.jit.save(module, model_buffer)
exp.start(db, summary=True)
address = db.get_address()[0]
client = Client(address=address, cluster=False)
client.put_tensor("input", example_forward_input.numpy())
client.set_model("cnn", model_buffer.getvalue(), "TORCH", device="CPU")
client.run_model("cnn", inputs=["input"], outputs=["output"])
output = client.get_tensor("output")
print(f"Prediction: {output}")
exp.stop(db)
To run:
python simple_torch_inference.py
For more examples of how to use SmartSim and SmartRedis together to perform online inference, please see the tutorials section of the SmartSim documentation.
The following are public presentations or publications using SmartSim
- Collaboration with NCAR - CGD Seminar
- SmartSim: Using Machine Learning in HPC Simulations
- SmartSim: Online Analytics and Machine Learning for HPC Simulations
- PyTorch Ecosystem Day Poster
Please use the following citation when referencing SmartSim, SmartRedis, or any SmartSim related work.
Partee et al., “Using Machine Learning at Scale in HPC Simulations with SmartSim: An Application to Ocean Climate Modeling,” arXiv:2104.09355, Apr. 2021, [Online]. Available: http://arxiv.org/abs/2104.09355.
```latex
@misc{partee2021using,
title={Using Machine Learning at Scale in HPC Simulations with SmartSim: An Application to Ocean Climate Modeling},
author={Sam Partee and Matthew Ellis and Alessandro Rigazzi and Scott Bachman and Gustavo Marques and Andrew Shao and Benjamin Robbins},
year={2021},
eprint={2104.09355},
archivePrefix={arXiv},
primaryClass={cs.CE}
}
```