AWS Plugin for Slurm
Note from September 11, 2020: We've redeveloped the Slurm plugin for AWS. The new version is available in the plugin-v2 branch. Major changes includes: support of EC2 Fleet capabilities such as Spot or instance type diversification, decoupling node names from instance host names or IP addresses, better error handling when a node fails to respond during its launch. You can use the following command to clone this branch locally.
git clone -b plugin-v2 https://github.com/aws-samples/aws-plugin-for-slurm.git
A sample integration of AWS services with Slurm
License Summary
This sample code is made available under a modified MIT license. See the LICENSE file.
Requirements
You will need an AWS Account with S3 Read/Write permissions. As well as the ability to execute CloudFormation scripts. The cloudformation script will provision a landing zone with a public subnet and 3 private subnets each private subnet will route into the public subnet via a NAT Gateway. Permissions to create the network topology will be needed.
You can optionally add an EFS endpoint so that all ephemeral SLURM compute nodes and the headnode can have a common namespace.
Instructions
- Register your AWS account with the CentOS 7 (x86_64) - with Updates HVM marketplace subscription.
- Clone the github and sync the contents into a S3 bucket which will be used later to stand up the cluster.
- Download the SLURM source from SchedMD here and copy into the S3 bucket created earlier.
- Edit slurm_headnode_cloudformation.yml file with the version of the SLURM source used:
SlurmVersion:
Description: Select SLURM version to install
Type: String
Default: 17.11.8
AllowedValues:
- 17.11.8
- Open the AWS Cloudformation Console and upload the slurm_headnode_cloudformation.yml under the Cloudformation -> Create Stack
The cloudformation will create the 1 Public and 3 Private Subnets and a single EC2 Instance as the SLURM Headnode. The SLURM source package you uploaded earlier will be retrieved, extracted, and the SLURM stack will be installed. A NFS server will be setup which will be used a common namespace for the slurm configuration.
- The elastic compute portion of the slurm.conf can be found at
/nfs/slurm/etc/slurm.conf
SuspendTime=60
ResumeTimeout=250
TreeWidth=60000
SuspendExcNodes=ip-10-0-0-251
SuspendProgram=/nfs/slurm/bin/slurm-aws-shutdown.sh
ResumeProgram=/nfs/slurm/bin/slurm-aws-startup.sh
ResumeRate=0
SuspendRate=0
You will find explainations of the parameters on the SLURM Elastic Computing - SchedMD.
- Example of the running the slurm ephermal cluster, in the initial state the
sinfo
shows that no nodes are currently available. Once thetest.sbatch
file is submitted 2 nodes will be stood up (executed by the ResumeProgram) added to the cluster and will be ready for work.
NOTE: The cluster will just allow the ephemeral nodes to be stood up in a single AZ. For additional AZs follow the example the in /nfs/slurm/etc/slurm.conf.d/slurm_nodes.conf
NodeName=ip-10-0-1-[6-250] CPUs=8 Feature=us-west-2a State=Cloud
NodeName=ip-10-0-2-[6-250] CPUs=8 Feature=us-west-2b State=Cloud
NodeName=ip-10-0-3-[6-250] CPUs=8 Feature=us-west-2c State=Cloud
NOTE: With minor modification you can modify the slurm-aws-startup.sh
and slurm-aws-shutdown.sh
by adding local AWS_CREDENTIALS to burst from an on-prem SLURM headnode that is managining an on-prem compute cluster. You need to ensure that you can resolve AWS private address either through an AWS DirectConnect and/or VPN layer.
NOTE: This demo uses c4.2xlarge instance types for the compute nodes, which have statically set the number of CPUs=8 in slurm_nodes.conf
. If you want to expierment with different instance types (in slurm-aws-startup.sh
) ensure you change the CPUs in slurm_nodes.conf
.