This repository contains playbooks and documentation to deploy virtual Linux HPC clusters, which can be used as collaborative, analytical sandboxes. All clusters were named after robots that appear in the animated sitcom Futurama
The main ingredients for (deploying) these clusters:
- Ansible playbooks for system configuration management.
- OpenStack for virtualization. (Note that deploying the OpenStack itself is not part of the configs/code in this repo.)
- Spacewalk to create freezes of Linux distros.
- CentOS 7 as OS for the virtual machines.
- Slurm as workload/resource manager to orchestrate jobs.
The master and develop branches of this repo are protected; updates can only be merged into these branches using reviewed pull requests.
Once a while we create releases, which are versioned using the format YY.MM.v
where:
YY
is the year of releaseMM
is the month of releasev
is the first release in that month and year. Hence it is not the day of the month.
E.g. 19.01.1
is the first release in January 2019.
We follow the Python PEP8 naming conventions for variable names, function names, etc.
This repo currently contains code and configs for the following clusters:
- Gearshift: UMCG Research IT cluster hosted by the Center for Information Technology (CIT) at the University of Groningen.
- Talos: Development cluster hosted by the Center for Information Technology (CIT) at the University of Groningen.
- Hyperchicken: Solve-RD cluster hosted by The European Bioinformatics Institute (EMBL-EBI) in the Embassy Cloud.
Deployment and functional administration of all clusters is a joined effort of the Genomics Coordination Center (GCC) and the Center for Information Technology (CIT) from the University Medical Center and University of Groningen.
The clusters are composed of the following type of machines:
- Jumphost: security-hardened machines for SSH access.
- User Interface (UI): machines for job management by regular users.
- Deploy Admin Interface (DAI): machines for deployment of bioinformatics software and reference datasets without root access.
- Sys Admin Interface (SAI): machines for maintenance / management tasks that require root access.
- Compute Node (CN): machines that crunch jobs submitted by users on a UI.
The clusters use the following types of storage systems / folders:
Filesystem/Folder | Shared/Local | Backups | Mounted on | Purpose/Features |
---|---|---|---|---|
/home/${home}/ | Shared | Yes | UIs, DAIs, SAIs, CNs | Only for personal preferences: small data == tiny quota. |
/groups/${group}/prm[0-9]/ | Shared | Yes | UIs, DAIs | permanent storage folders: for rawdata or final results that need to be stored for the mid/long term. |
/groups/${group}/tmp[0-9]/ | Shared | No | UIs, DAIs, CNs | temporary storage folders: for staged rawdata and intermediate results on compute nodes that only need to be stored for the short term. |
/groups/${group}/scr[0-9]/ | Local | No | Some UIs | scratch storage folders: same as tmp, but local storage as opposed to shared storage. Optional and available on all UIs. |
/local/${slurm_job_id} | Local | No | CNs | Local storage on compute nodes only available during job execution. Hence folders are automatically created when a job starts and deleted when it finishes. |
/mnt/${complete_filesystem} | Shared | Mixed | SAIs | Complete file systems, which may contain various home , prm , tmp or scr dirs. |
Deploying a fully functional virtual cluster involves the following steps:
- Configure physical machines
- Deploy OpenStack virtualization layer on physical machines to create an OpenStack cluster
- Create and configure virtual machines on the OpenStack cluster to create an HPC cluster on top of an OpenStack cluster
- Deploy bioinformatics software and reference datasets
The ansible playbooks in this repository use roles from the hpc-cloud repository. The roles are imported here explicitely by ansible using ansible galaxy. These roles install various docker images built and hosted by RuG webhosting. They are built from separate git repositories on https://git.webhosting.rug.nl.
The steps below describe how to get from machines with a bare ubuntu 16.04 installed to a running openstack installation.
-
Clone this repo.
mkdir -p ${HOME}/git/ cd ${HOME}/git/ git clone https://github.com/rug-cit-hpc/league-of-robots.git
-
First import the required roles into this playbook:
ansible-galaxy install -r requirements.yml --force -p roles ansible-galaxy install -r galaxy-requirements.yml
-
Create
.vault_pass.txt
.- To generate a new Ansible vault password and put it in
.vault_pass.txt
, use the following oneliner:
tr -cd '[:alnum:]' < /dev/urandom | fold -w30 | head -n1 > .vault_pass.txt
- Or to use an existing Ansible vault password create
.vault_pass.txt
and use a text editor to add the password. Make sure the.vault_pass.txt
is private:
chmod go-rwx .vault_pass.txt
- To generate a new Ansible vault password and put it in
-
Configure Ansible settings including the vault.
-
To create (a new) secrets.yml: Generate and encrypt the passwords for the various openstack components.
./generate_secrets.py ansible-vault --vault-password-file=.vault_pass.txt encrypt secrets.yml
The encrypted secrets.yml can now safely be comitted. The
.vault_pass.txt
file is in the .gitignore and needs to be tranfered in a secure way. -
To use use an existing encrypted secrets.yml add .vault_pass.txt to the root folder of this repo and create in the same location ansible.cfg using the following template:
inventory = hosts stdout_callback = debug forks = 20 vault_password_file = .vault_pass.txt remote_user = your_local_account_not_from_the_LDAP
-
-
Build Prometheus Node Exporter
- Make sure you are a member of the
docker
group. Otherwise you will get this error:Is 'docker daemon' running on this host?: dial unix /var/run/docker.sock: connect: permission denied context canceled
- Execute:
cd promtools ./build.sh
- Make sure you are a member of the
-
Running playbooks. Some examples:
- Install the OpenStack cluster.
ansible-playbook site.yml
- Deploying only the SLURM part on test cluster Talos
ansible-playbook site.yml -i talos_hosts slurm.yml
- Install the OpenStack cluster.
-
verify operation.