Work-in-progress for a terraform-defined Slurm cluster with:
- VM login nodes
- Baremetal or VM compute nodes
- Slurm-driven reimaging
- Resizing of compute nodes
- Flexible, ansible-based deployment of addtional cluster functionality e.g. filesystems, monitoring etc.
Not all the above currently works.
THIS NEEDS UPDATING
It is recommended to install python dependencies in a virtual environment:
virtualenv venv
source venv/bin/activate
pip install -U pip
pip install -r requirements.txt
If SELinux is in use on the ansible control host, enable access to the
selinux
python module from the virtualenv:
ln -s /usr/lib64/python2.7/site-packages/selinux venv/lib/python2.7/site-packages/selinux
Download and deploy the role from Ansible Galaxy:
ansible-galaxy install -r ansible/requirements.yml -p $PWD/ansible/roles
Deactivate the virtual environment:
deactivate
Prior to using stackhpc-appliances, ensure the virtual environment is activated:
source venv/bin/activate
Both terraform and ansible are configured configured locally through YAML files.
As part of development these are currently split between:
config/deploy.yml
- only those variables needed to create the cluster nodesconfig/openhpc.yml
- only those variables needed to configure the cluster (i.e. there are no double-defintions)
-
Modify these as required.
-
Create the head node using:
cd infra terraform apply # uses config/deploy.yml only
-
Create the compute nodes using:
cd ../computes terraform apply # uses config/deploy.yml only
These will generate inventory files in the inventory/
directory.
-
Deploy and configure software using:
cd <repo root> ansible-playbook -i inventory/ --vault-password-file ~/alaska-vault-password -e @config/openhpc.yml -e @config/deploy.yml ansible/cluster.yml
Aside from functionality specifically listed above:
- Only manages OpenHPC cluster.
- Uses terraform instead of Heat + cluster-infra role.
- Uses
actions
branch of stackhpc.openhpc role. - Does not generate
/etc/hosts
- not needed now DNS fixed.
- FIXME: Remove double config/extravars in ansible
- FIXME: monitoring is disabled
- FIXME: sdd removed from raid array (because its broken on one node).