Vagrant-based system to spawn a ready to use multi-node Kubernetes/Slurm VMs cluster. Usable for testing purposes, learning, and development.

0. Prerequisites and dependencies

The only prerequisites are basic virtualization tools, in particular with Vagrant and it’s dependencies. The virtualization is done using `libvirt` and `qemu-kvm` on Linux systems, but it should be possible (not tested) to use other providers like VirtualBox or VMware.

On RHEL-based system you can install them with:

sudo dnf install -y $(sed -r '/^#/d' requirements.txt)

and enable the virtualization services:

sudo systemctl enable --now libvirtd
sudo usermod -aG libvirt $(whoami)
vagrant plugin install vagrant-libvirt

Even if it’s not strictly necessary, I highly recommend to install vagrant-scp plugin to easily copy files to/from the VMs.

vagrant plugin install vagrant-scp

1. Set up the VMs

The VMs are defined in the Vagrantfile and can be customized to your needs.

By default the cluster are composed by 1 master (kube-00) and 2 workers nodes.
The VMs are based on Fedora cloud images
k8s is used as Kuberentes cluster manager
The default user is vagrant with password vagrant
In the home of vagrant user, there is mounted as shared folder among nodes the `shared` directory.
Feel free to changes number of nodes and resources, but be aware that:
- To run Kubernetes the minimum requirements are 2GB of RAM and 2 CPUs
- You need to manually adjust the files slurm_install.sh and slurm_worker.sh to reflect the number of nodes and their names. This should be done both at the begin of the file, where it defines int /etc/hosts the names of the nodes, and at the end, where it creates the slurm.conf file.

To start the VMs run:

sudo virsh net-define scripts/kube-net.xml
sudo virsh net-start kube-net
vagrant up --provider=libvirt --no-parallel

Once the VMs are up and running, you can access them with:

vagrant ssh kube-00

Note that `vagrant` take cares of the SSH keys and the IP addresses of the VMs, however, if you need to access the VMs directly (e.g. to use the ssh extension for VScode) , you can use the ssh keys available in the ssh directory.

1.1 kubeconfig file

You can control the cluster using the kubectl command. To do that you don’t need to ssh into the master node every time, but you can use the kubeconfig file available in the vagrant user home directory. To use it, you can copy it to your local machine with:

vagrant scp kube-00:/home/vagrant/.kube/config ./playground_kubeconfig.yaml

and then set the KUBECONFIG environment variable to point to the file:

export KUBECONFIG=$(pwd)/playground_kubeconfig.yaml

2. Set up the CNI

Install a CNI

By default no Container Network Interface (CNI) is installed. You can choose between flannel and calico by running (ssh into the master node):

/home/vagrant/deploy_calico.sh

/home/vagrant/deploy_flannel.sh

this will launch the already pre-configured scripts deploy_calico.sh and deploy_flannel.sh.

**Important**: once the CNI is installed - you can check it using kubectl or k9s - you must restart all the nodes to apply the changes. To do that just run:

vagrant reload

Uninstall a CNI

To switch between CNIs or to uninstall it, you can run:

helm uninstall flannel --namespace kube-flannel

kubectl delete -f /home/vagrant/calico.yaml

and then restart the nodes.

3. Use the virtual cluster in HPC-flavored mode

The cluster is also configured to run Slurm, a job scheduler and resource manager for HPC systems. All the nodes are configured as a debug partition.

**Remark**: At this moment, due to some issues, the Slurm is working only for the root users. Enabling it for non-root users is a future TODO. Since this environment should be used for testing and learning purposes, this limitation should not be a big deal.

TODO and working in progress:

[ ] Optimize the automatic deployment using Ansible and kubespray
[ ] Enable Slurm for non-root users
[ ] Add more CNIs (e.g. cilium)

IsacPasianotto/playground