illume-v2

Rebuilding Illume cluster using VM workflow. Created to be as generic as possible to allow for use elsewhere.

Overview
Prerequisites
Build VM images
Deploying to OpenStack
Monitoring
How-to Guides
Authors and acknowledgements

Overview

Illume is the infrastructure-as-code ready to deploy on OpenStack for HPC workloads. It contains:

HTCondor, a batch scheduler for running user jobs
openLDAP and phpLDAPadmin for user account management
Prometheus and Grafana, for monitoring physical hardware health (and potentially jobs, in the future)
CVMFS, for access to global project repositories of software and libraries
Squid Proxy for caching (particularly for CVMFS)
Nvidia CUDA drivers and libraries for accelerating workloads with GPUs
Rootless Podman and Singularity for safe container workloads
Anaconda and Jupyter for a wide range of tools for users

Illume is designed for use with NFS for storage, but it shouldn't be too difficult to support other types.

This is achieved with a two-stage process - using Packer to build VM images with all appropriate software, and then deploying them via Terraform. Both of these are easily configurable to suit your needs; within the /packer directory you will find /bootstrap, which contains groups of scripts and configuration files used to install certain tools, and /vm-profiles, which contains the image definitions composed of these bootstrap scripts.

In the /terraform directory, you will find a collection of host- profiles, which are the profiles of the instances we want to create on the hardware. These can easily be scaled and customized to fit your hardware, and even modified (with a bit of work) to suit other infrastructure providers like AWS, as Terraform offers API's for many of them.

Prerequisites

Packer 1.7.2+
Terraform 1.0.0+
OpenStack RC File (can be retrieved by logging into OpenStack -> click username in the top right -> Download OpenStack RC File V3)
An SSH key pair for provisioning
(Optional) OpenStack Client - This is helpful for retrieving information from OpenStack like flavors etc.

Fill in /setup-env.sh with your SSH key location and the path to the OpenStack RC file, and then run it. You also need to create a file called /terraform/variables.tfvars with assignments for the secret variables in /terraform/variables.tf. You can also leave the fields in /terraform/variables.tf as

{
    default = ""
}

if you want to be prompted for them each time you run a Terraform command. DO NOT FILL IN /terraform/variables.tf; instead, fill in /terraform/variables.tfvars which allows you to keep your credentials separated from the variable template. DO NOT COMMIT WITH YOUR INFORMATION FILLED IN.

NOTE - Certain information is not included in the repository, as it is hosted in an NFS drive which gets mounted when provisioned. This includes:

LDAP configuration and database
User home directories (that correspond to the LDAP accounts)
Grafana and Prometheus dashboards and configuration

Build VM Images

The VM images are located under /packer/vm-profiles. The images are dependent on one another in sensical ways to keep build times down the higher up the stack you go, while also keeping the profiles themselves concise and lacking repetition. The hierarchy is as follows:

 non-interactive ->|-> openLDAP
                   |-> proxy
                   |-> monitor
                   |-> control
                   |-> phpLDAPadmin
                   |-> interactive ->|-> bastion
                                     |-> ingress
                                     |-> worker-nogpu ->|-> worker-gpu

This organization also makes it easier to make changes to multiple images while only modifying one. For example, if you wanted to add numpy, you could add it to interactive and then rebuild the images that depend on it, giving them all numpy.

NOTE - When rebuilding the worker-gpu image, at least 1 GPU must be unassigned in OpenStack. This is because Packer will spin up an instance with a GPU to build the image, since it needs one in order for CUDA and other GPU packages to install and be tested correctly.

Deploying to OpenStack

As with the VM images, the Terraform deployment profiles are set up with dependencies so that post-provisioning can be done once the appropriate instances are deployed. However, Terraform takes care of building them in the correct order so you only need to:

Navigate to /terraform
Run terraform init
Run terraform plan -var-file="variables.tfvars" to verify your changes. This will fill in the variables with your .tfvars file created above
Once happy, run terraform apply -var-file="variables.tfvars" which you can then accept if it looks good

You can then view the provisioned instances in the OpenStack dashboard under Compute -> Instances.

Monitoring

Illume v2 uses Prometheus to scrape data from nodes, and Grafana to visualize that data. Currently, there are only two exporters in use:

Node exporter, which advertises tons of hardware, OS and networking data (runs on ALL nodes)
Nvidia exporter, which advertises various data related to GPUs (only runs on GPU workers)

I have added metadata to the Packer images where appropriate to allow Prometheus to distinguish which nodes to scrape what information from, and one could add even more if they wish to add more exporters or rules.

Grafana cannot be set up automatically, and so one must log in and configure the dashboard accordingly. The steps to do so are:

Create (if it doesn't already exist) ~/.ssh/config
Create an entry for the bastion host that looks something like this with the public IP and path to key filled in:

Host bastion                                                                                          
     HostName xxx.xxx.xxx.xxx (public IP)                                                                            
     User ubuntu                                                                                         
     IdentityFile /path/to/key

Test the above to make sure it works by saving it, then run ssh bastion
Now that the bastion connection works, create a second entry in ~/.ssh/config like this:

Host grafana                                                                                     
     User ubuntu                                                                                         
     HostName xxx.xxx.xxx.xxx (fixed IP)                                                                             
     IdentityFile /path/to/key                                                                      
     ProxyJump bastion                                                                                
     LocalForward 3000 localhost:3000

Since Grafana is only hosted internally, we must forward port 3000 and then connect via the bastion as that is the only way into the network from the outside (aside from the ingress). ProxyJump will perform this intermediate connection.

Once that is working and you have successfully logged into the Grafana instance, move to your web browser and put in

http://localhost:3000

If everything was done correctly then you should have landed on the Grafana login page.

Enter the following defaults:

Username: admin
Password: admin

You will be prompted to change the password - do so now.

Click the gear icon on the left sidebar, navigate to "Data Sources" and click "Add data source"
The first option should be "Prometheus" - click that
Under the HTTP section, enter http://localhost:9090 for the URL and then scroll to the bottom and click "Save & Test". It should show a green "success" message.
Now that Prometheus is set up, we need to import pre-made dashboards to use for each of the data types that are being exported. In the left sidebar hover over the "+" icon and click "Import"
In the bar that says "Import via grafana.com", input 1860 and click load. This should fill in information telling you that you are trying to import a Node Exporter dashboard
In the dropdown at the bottom, select the Prometheus instance we just set up and click import.
Repeat the above steps for importing but this time use the ID 10703 which is for the Nvidia exporter.

Once saved, the dashboards are successfully set up.

How-to Guides

How to perform maintenance

If you want to perform software updates or install new software/tools, this will be done by modifying the corresponding Packer files; if you instead need to simply scale the number of nodes up/down, or make changes to a configuration within the Terraform directory, you can skip this section and move on to the Terraform section.

Packer

Make the changes to the relevant file(s); for example, if you wanted to install htop across the entire cluster, you would add it to packer/bootstrap/common/common.sh. If you only want to add it to user-facing instances, you can instead place it in packer/bootstrap/tools/user-tools.sh, which will install it on the ingress and all workers. If you simply want to perform an update to the currently installed software, move on to the next step.
After any changes have been made, you can rebuild the image(s). IMPORTANT - OpenStack doesn't seem to provide a timestamp to images, and a rebuild won't overwrite the older image, so it may be very confusing if you don't delete the current image(s) before rebuilding. I did include a condition in Terraform to choose the most recent image when provisioning, but it is best to delete old images that aren't needed anymore. Note the diagram in the Build VM images section for the order. You can also use the helper script packer/vm-profiles/build-all.sh, which contains the appropriate order for rebuilding the VMs. By rebuilding the images, you will also be performing a package update, so any pending security and package updates will be applied.
Now that the images are all rebuilt, you can move on to Terraform to provision instances with these images.

Terraform

Make changes to the relevant file(s); for example, to increase the number of 1080ti workers, modify the "1080ti" value in the "name_counts" variable in variables.tf.
After making any changes, you can provision the cluster with terraform apply -var-file="variables.tfvars". Terraform will scan the currently deployed cluster and compare it against your local profiles to find any changes. If any are found, it will redeploy the relevant instance(s). IMPORTANT - If a change is made to a template (under terraform/templates), Terraform may not be able to detect it as it is "user data" that is used as the cloud-config file to perform first-boot setup. In these cases, you can delete the instance(s) first, and then provision fresh ones.

How to access LDAP interface

Illume v2 uses phpLDAPadmin as an interface over openLDAP. To access the web interface for easy account management:

Create (if it doesn't already exist) ~/.ssh/config
Create an entry for the bastion host that looks something like this with the public IP and path to key filled in:

Host bastion                                                                                          
     HostName xxx.xxx.xxx.xxx (public IP)                                                                            
     User ubuntu                                                                                         
     IdentityFile /path/to/key

Test the above to make sure it works by saving it, then run ssh bastion
Now that the bastion connection works, create a second entry in ~/.ssh/config like this:

Host phpLDAPadmin                                                                                     
     User ubuntu                                                                                         
     HostName xxx.xxx.xxx.xxx (fixed IP)                                                                             
     IdentityFile /path/to/key                                                                      
     ProxyJump bastion                                                                                
     LocalForward 8080 localhost:80

Since the LDAP server and php interface are hosted internally only, we must forward port 80 and then connect via the bastion as that is the only way into the network from the outside (aside from the ingress). ProxyJump will perform this intermediate connection.

Once that is working and you have successfully logged into the php instance, move to your web browser and put in

http://localhost:8080/phpldapadmin/

If everything was done correctly then you should have landed on the phpLDAPadmin login page.

How to debug LDAP

LDAP is one of the more complicated parts of the cluster. To ensure that it is working, you can ssh into the openLDAP instance (via the Bastion since it isn't exposed to the internet) and run

ldapsearch -x -b cn=First Last,ou=users,dc=illume,dc=systems

where First Last is the users' full name. This line can also be retrieved from phpLDAPadmin's web interface by choosing a user and clicking Show internal attributes.

If the LDAP server is successfully running, you should see output like

# extended LDIF
#
# LDAPv3
# base <cn=First Last,ou=users,dc=illume,dc=systems> with scope subtree
# filter: (objectclass=*)
# requesting: ALL
#

# First Last, users, illume.systems
dn: cn=Test Man,ou=users,dc=illume,dc=systems
cn: First Last
givenName: First
gidNumber: 501
homeDirectory: /home/users/flast
sn: Last
objectClass: inetOrgPerson
objectClass: posixAccount
objectClass: top
uidNumber: 1050
uid: flast
loginShell: /bin/bash

# search result
search: 2
result: 0 Success

# numResponses: 2
# numEntries: 1

How to deploy a test cluster

Clone the repository again, and rename it to illume-v2-testing to differentiate it from the production one
Follow the steps in Prerequisites and Deploying to OpenStack BUT DON'T PROVISION IT YET
Once the repo is populated with OpenStack credentials and initialized for Terraform, navigate to terraform/variables.tfvars and set testing to true. Then, set local_subnet to the illume-v1 subnet (to keep the test cluster isolated from the prod one). The testing variable will modify the instance names to have -TESTING appended to the end, along with switching the secgroups to using the illume-v1 variants
While still looking at terraform/variables.tfvars, set the appropriate number of worker instances at the bottom. Currently we want all GPUs (except for 1; see note in Build VM images) to be dedicated to production use, so you can likely only enable CPU-only configurations
Ensure you are in the terraform directory, then run terraform apply -var-file="variables.tfvars" to apply your configuration, which should deploy the test cluster without touching the production one. Verify that everything went as anticipated in the Cirrus control panel

Authors and acknowledgements

Thanks to Claudio Kopper and David Schultz for mentoring and helping me - without them, this would not have been possible.

jamierajewski/illume-v2