This project allows AWS ParallelCluster with the Slurm scheduler to be aware of the Amazon EC2 Instances network topology. It enables jobs to be placed on nodes in close network proximity with Slurm topology plugin.
In November 2023, AWS announced the Instance Topology API. It provides customers a unique per account hierarchical view of the relative proximity between Amazon EC2 instances. To learn more, please visit the EC2 User Guide.
This solution walks you through the steps to:
- Create a topology configuration for Slurm based the EC2 Instance topology information.
- Enable the topology plugin in Slurm.
- Update Slurm configuration to use topology-aware scheduling.
NOTE:We recommend this solution for static compute cluster.
Before starting, make sure you have the following permission on the AWS ParallelCluster HeadNode:
ec2:DescribeInstanceTopology
You can add this by adding the arn:aws:iam::aws:policy/AmazonEC2ReadOnlyAccess
managed policy to the HeadNode AdditionalIamPolicies config.
You start creating the topology.conf
file that describes the network topology of the Amazon EC2 Instances of your cluster.
Connect to the HeadNode of your AWS ParallelCluser based and download the content of this repository:
git clone https://github.com/aws-samples/ec2-topology-aware-for-slurm.git
cd ec2-topology-aware-for-slurm
The ec2-topology.py
script takes as argument the cluster NAME and the Amazon EC2 instance type associated with the Slurm partition.
For this step and the following, you will need to become root
user create the topology.conf
configuration file and restart slurm services.
sudo -s
Let's create a Python Virtual Environment.
export AWS_DEFAULT_REGION=$(TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"` \
&& curl -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/placement/region)
python3 -m venv env
source env/bin/activate
python3 -m pip install --upgrade pip
pip3 install -r requirements.txt
Run the ec2-topology.py
Python script that will create the topology.conf
file at /opt/slurm/etc/
.
python3 ec2-topology.py --cluster_name [CLUSTER_NAME] --instance_type [INSTANCE_TYPE]
Exit the Python Virtual Environment.
deactivate
Edit the Slurm configuration file, slurm.conf
, to setup topology-aware scheduling.
cat >> /opt/slurm/etc/slurm.conf << EOF
TopologyPlugin=topology/tree
TopologyParam=RouteTree
EOF
After editing, you will ask the compute node to re-read the slurm.conf
file and restart the slurm controller.
scontrol reconfigure
systemctl restart slurmctld
Your Slurm jobs will now be scheduled based on the Amazon EC2 instance topology.
Maxime Hugues (AWS)
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.