This ARM template is inspired by Christian Smith template:
- BeeGFS tempate: https://github.com/smith1511/hpc/tree/master/beegfs-shared-on-centos7.2
- Slurm template: https://github.com/smith1511/hpc/tree/master/slurm-on-centos7.1-hpc
I have merged the both template.
Deploys on the same set of VM:
- BeeGFS cluster with metadata and storage nodes
- Slurm as Job Scheduler
-
Fill in the mandatory parameters.
-
Select an existing resource group or enter the name of a new resource group to create.
-
Select the resource group location.
-
Accept the terms and agreements.
-
Click Create.
The VM called storage0 is :
- the BeeGFS metadata server + management host
- the slurm master
- NFS server: export the following shared storage /share/home & /share/data
The VMs called storage[1-n] are:
- BeeGFS storage server
- [Optionnal] some of them may also be BeeGFS metadata server (based on the template parameters)
- Slurm compute nodes
The BeeGFS storage is mounted on /share/scratch on every nodes
Each compute node by default has 1 core avalaible for slurm
You should change the slurm.conf file to adapt it to the real number of cpu:
NodeName=storage[1-number_of_nodes] Procs=16
Then restart the slurm daemon:
systemctl restart slurmctld
And put the nodes on ine with scontrol:
scontrol: update NodeName=storager0 State=RESUME
scontrol: update NodeName=storager1 State=RESUME
scontrol: exit
Then control with:
sinfo -N -l
Simply SSH to the master node using the IP address.
# ssh [user]@[public_ip_adress]
You can log into the first metadata node using the admin user and password specified.
- check that all package intalled during install_pkgs_slurm fonction in deployazure.sh are mandatory
- let the user chose how many data disk per VM
- use VMSS instead of VM
- use Ganglia for monitoring
- enble MPI if RDMA instance are used + uses HPC images of CentOS