/league-of-robots

Code and configs for deploying (virtual) HPC clusters.

Primary LanguageShellGNU General Public License v3.0GPL-3.0

League of Robots

About this repo

This repository contains playbooks and documentation to deploy virtual Linux HPC clusters, which can be used as collaborative, analytical sandboxes. All clusters were named after robots that appear in the animated sitcom Futurama

Software/framework ingredients

The main ingredients for (deploying) these clusters:

  • Ansible playbooks for system configuration management.
  • OpenStack for virtualization. (Note that deploying the OpenStack itself is not part of the configs/code in this repo.)
  • Spacewalk to create freezes of Linux distros.
  • CentOS 7 as OS for the virtual machines.
  • Slurm as workload/resource manager to orchestrate jobs.

Branches and Releases

The master and develop branches of this repo are protected; updates can only be merged into these branches using reviewed pull requests. Once a while we create releases, which are versioned using the format YY.MM.v where:

  • YY is the year of release
  • MM is the month of release
  • v is the first release in that month and year. Hence it is not the day of the month.

E.g. 19.01.1 is the first release in January 2019.

Code style and naming conventions

We follow the Python PEP8 naming conventions for variable names, function names, etc.

Clusters

This repo currently contains code and configs for the following clusters:

Deployment and functional administration of all clusters is a joined effort of the Genomics Coordination Center (GCC) and the Center for Information Technology (CIT) from the University Medical Center and University of Groningen.

Cluster components

The clusters are composed of the following type of machines:

  • Jumphost: security-hardened machines for SSH access.
  • User Interface (UI): machines for job management by regular users.
  • Deploy Admin Interface (DAI): machines for deployment of bioinformatics software and reference datasets without root access.
  • Sys Admin Interface (SAI): machines for maintenance / management tasks that require root access.
  • Compute Node (CN): machines that crunch jobs submitted by users on a UI.

The clusters use the following types of storage systems / folders:

Filesystem/Folder Shared/Local Backups Mounted on Purpose/Features
/home/${home}/ Shared Yes UIs, DAIs, SAIs, CNs Only for personal preferences: small data == tiny quota.
/groups/${group}/prm[0-9]/ Shared Yes UIs, DAIs permanent storage folders: for rawdata or final results that need to be stored for the mid/long term.
/groups/${group}/tmp[0-9]/ Shared No UIs, DAIs, CNs temporary storage folders: for staged rawdata and intermediate results on compute nodes that only need to be stored for the short term.
/groups/${group}/scr[0-9]/ Local No Some UIs scratch storage folders: same as tmp, but local storage as opposed to shared storage. Optional and available on all UIs.
/local/${slurm_job_id} Local No CNs Local storage on compute nodes only available during job execution. Hence folders are automatically created when a job starts and deleted when it finishes.
/mnt/${complete_filesystem} Shared Mixed SAIs Complete file systems, which may contain various home, prm, tmp or scr dirs.

Deployment phases

Deploying a fully functional virtual cluster involves the following steps:

  1. Configure physical machines
  2. Deploy OpenStack virtualization layer on physical machines to create an OpenStack cluster
  3. Create and configure virtual machines on the OpenStack cluster to create an HPC cluster on top of an OpenStack cluster
  4. Deploy bioinformatics software and reference datasets

2. Ansible playbooks OpenStack cluster

The ansible playbooks in this repository use roles from the hpc-cloud repository. The roles are imported here explicitely by ansible using ansible galaxy. These roles install various docker images built and hosted by RuG webhosting. They are built from separate git repositories on https://git.webhosting.rug.nl.

Deployment of OpenStack

The steps below describe how to get from machines with a bare ubuntu 16.04 installed to a running openstack installation.


  1. Clone this repo.

    mkdir -p ${HOME}/git/
    cd ${HOME}/git/
    git clone https://github.com/rug-cit-hpc/league-of-robots.git
  2. First import the required roles into this playbook:

    ansible-galaxy install -r requirements.yml --force -p roles
    ansible-galaxy install -r galaxy-requirements.yml
  3. Create .vault_pass.txt.

    • To generate a new Ansible vault password and put it in .vault_pass.txt, use the following oneliner:
    tr -cd '[:alnum:]' < /dev/urandom | fold -w30 | head -n1 > .vault_pass.txt
    • Or to use an existing Ansible vault password create .vault_pass.txt and use a text editor to add the password. Make sure the .vault_pass.txt is private:
    chmod go-rwx .vault_pass.txt
  4. Configure Ansible settings including the vault.

    • To create (a new) secrets.yml: Generate and encrypt the passwords for the various openstack components.

      ./generate_secrets.py
      ansible-vault --vault-password-file=.vault_pass.txt encrypt secrets.yml

      The encrypted secrets.yml can now safely be comitted. The .vault_pass.txt file is in the .gitignore and needs to be tranfered in a secure way.

    • To use use an existing encrypted secrets.yml add .vault_pass.txt to the root folder of this repo and create in the same location ansible.cfg using the following template:

      inventory = hosts
      stdout_callback = debug
      forks = 20
      vault_password_file = .vault_pass.txt
      remote_user = your_local_account_not_from_the_LDAP
      
  5. Build Prometheus Node Exporter

    • Make sure you are a member of the docker group. Otherwise you will get this error:
         Is 'docker daemon' running on this host?: dial unix /var/run/docker.sock: connect:
         permission denied
         context canceled
      
    • Execute:
      cd promtools
      ./build.sh
  6. Running playbooks. Some examples:

    • Install the OpenStack cluster.
      ansible-playbook site.yml
    • Deploying only the SLURM part on test cluster Talos
      ansible-playbook site.yml -i talos_hosts slurm.yml
  7. verify operation.

Steps to upgrade openstack cluster.

3. Steps to install Compute cluster on top of openstack cluster.