/gpu-monitor

This is a tool intended to monitor the GPU usage on the various GPU-servers at the LIP6 Lab, Paris

Primary LanguagePHP

IPC-Lab Server Cluster Management

Design

The server cluster for the lab is designed with the idea that it can be easily extended with new machines. At the head is Deniz Power, which hosts the dashboard showing the usage of each machine in the cluster. The cluster is made up of ever increasing number of machines and below I will detail how each machine reports its usage statistics to Deniz Power and how a new machine can be added to the cluster. It may not be the most elegant solution in terms of resource allocation (a kubernetes cluster for example would be more efficient), but given the background knowledge of everyone in the lab and effort to maintain, this is the solution we have landed on. Future server managers are of course free to change this setup but ideally without too much disruption.

An important detail, as you would have noticed, is that each machine is named after a "monster" from a specific culture. The idea is that each new student is given an opportunity to name a machine (when we get new machines of course) and he/she would pick a name consistent with this naming scheme.

We also maintain that sudo access is reserved for those that manage the cluster. The reason is because we do not want users to update/modify system level software, which can have unintended consequences to the stability or usability of the system for other users. Instead, we will maintain a bare minimum suite of software at the system level and ask that each user install the software they require in their userspace (e.g., via conda virtual environments). If a specific software needs to be installed at the system level, then the user is asked to contact the server managers and the server manager should evaluate whether it is indeed necessary and whether it would affect the system for other users before proceeding.

How to setup

New server software installation

Each server requires the following elements to be determined first before installation:

  1. Operating system. I recommend using the latest Long-term support (LTS) version of Ubuntu server (no GUI). It offers both stability and ease of maintenance and the packages are pretty stable. Archlinux is also a good choice and it offers rolling releases, which means there is no need to update the underlying OS version over time but the packages are newer and may be less stable and it is harder to install.
  2. Hostname. As mentioned earlier, the hostname of the server should follow the naming scheme of "monster"-based names. The definition of what is considered a monster is fairly loose. If for example you find Barney the purple dinosaur to be scary then you are free to name it as such, but be a bit more creative with it please. This name needs to be determined before the OS is installed as it needs to be named during the install process. It can always be changed post install but more difficult.

It is recommended that when a new server arrives, the server manager first boots it up into the preinstalled MS Windows OS and register the computer on the college network. This will ensure that during the installation of the OS the computer has access to the internet for any packages. The college requires any wired connection to the internet to be registered via a web browser which is why it is necessary to do this before the installation of the OS.

Once the computer is registered and has access to the internet, follow these steps:

  1. Boot the computer into the UEFI BIOS and deactivate secure boot. It interferes with Nvidia driver installation later on.
  2. Burn the OS image onto a flash drive and boot from it. Follow the installation procedure and enter the chosen hostname when prompted to name the computer. If installing Ubuntu server, there will be an option to install openssh with the OS. Make sure you tick that box as it is needed to access the machine via ssh. For other OSes, make sure you install openssh at some point.
  3. Once the installation is completed, boot the machine into the new OS and run a full system update. The machine should now be accessible via the ip address.
  4. Install Nvidia CUDA and CUDNN. The instructions can be found by searching on DuckDuckGo but I will provide the link to CUDA 11.7 here: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html. The link changes for each update so just search for the latest one when you're installing. Typically for Ubuntu servers, the process involves getting the GPT keys for the repository where Nvidia hosts its drivers and then running the apt install command.
  5. Once CUDA and CUDNN are installed, run nvidia-smi to check that the GPUs on the system can be found. If so then you have successfully installed all the required software. If not, enjoy debugging :)
  6. Now move the machine to its finally installation location (somewhere on the rack) and note down the MAC address of the server and the serial number. The MAC address can be queried by typing ip a into the terminal and checking the correct ethernet device port. It will be the one with an assigned IP address. The serial number is usually on a sticker at the back of the machine.
  7. With the above information, submit a ticket on ASK ICT to request a new computer to be registered on the network with the chosen hostname for DNS lookup. This will allow users to access the machine without typing in the ip address.
  8. Once ICT confirms this has been completed, you can access the machine using the chosen hostname. The only part left to do is to setup the GPU monitoring software...

Monitoring setup

Firstly, this code was copied from another repository and as the author of that repo notes, this code has been written with the "quickest and dirtiest" principle in mind, it is absolutely awful, please do not read it 😣

The principle is as follows. A bunch of Bash / Python scripts runs regularly nvidia-smi and ps to extract data and sends them to my public_html space. Each time someone wants to see the status of the GPUs, the page index.php reads the latest data files for each server and displays those.

Put the files that are in the scripts folder on the machines you want to monitor. The scripts are as follows:

  • gpu-run.sh <task_id> loops on one of the three tasks (task_id being 1, 2 or 3). Task 1 extracts GPU usage stats each 20s, task 2 extracts GPU processes each 20s, task 3 extracts ps info tha corresponds to GPU processes each 10s and copies all the monitoring files to the public_html space. This scripts uses the HOST env variable.
  • gpu-processes.py is what's ran by task 3
  • gpu-check.sh <hostname> checks if the 3 tasks are running, if not it will launch them in the background. Also gpu-check.sh kill will stop the tasks if running.

Edit the following files:

  1. gpu-check.sh: change the directory /home/USER/gpu-monitor/scripts/gpu-run.sh to where you've put the script. Remember to change all 3 instances of this.
  2. gpu-run.sh: change the directory /home/USER/gpu-monitor/scripts/gpu-processes.py to where you've put the python file. There is only one instance of this.

Next, setup SSH keys between the host server (the server to be monitored) and the website server (the server hosting the website). Follow the instructions here: https://www.redhat.com/sysadmin/passwordless-ssh. Remember not to set a password for the key. This will allow password-less copy between the machines, which is how we will get the stats copied by the scripts to Deniz Power.

Add the following cron jobs to your user ("crontab -e" then copy the following bit into the file):

# Edit full-caps infos below
# Check if monitoring running each 5 min
*/5 * * * * /SCRIPT-LOCATION/gpu-check.sh HOSTNAME HTTP_SERVER: > /dev/null 2>&1
# Kill and restart the monitoring each 2 hours to cleanup the ouptput files of the monitors
0 */2 * * * /SCRIPT-LOCATION/gpu-check.sh kill > /dev/null 2>&1; /SCRIPT-LOCATION/gpu-check.sh HOSTNAME HTTP_SERVER: > /dev/null 2>&1

for disk usage, also add this to the root cron job ("sudo crontab -e"):

*/10 * * * * du -sh /home/* > /tmp/local-usage.txt

To get things running, run gpu-check.sh <hostname> <http servername>:. Don't forget the colon at the end.

If you're unsure about the exact details, check one of the machines already setup and extrapolate.

Web interface setup

To setup the web interface, you just need to put the files of the repo (except scripts folder) on the www space of a web server that supports PHP.

Simply edit the index.php file to each the $HOSTS variable and optionnaly the $SHORT_GPU_NAMES variable.

$HOSTS associates the hostnames with some viewable names for these hosts. The keys are the ones entered as HOSTNAME in the crontab above and the <hostname> parameter of gpu-check.

$SHORT_GPU_NAMES allows you to rewrite GPU names if you want. It associates the names given by nvidia-smi to the names you want to be displayed.

Adding new users

To add new users, check the corresponding OS instructions for doing so. For Ubuntu, it is sudo adduser <username>. For archlinux, it is sudo useradd -m <username>, then change the password with sudo passwd <username>. By default, each new account will have the password changeme.

The only exceptions are Tepegoz, Nian and Bubota. Those machines were setup a long time ago and uses the college's LDAP for account sign-in. This means that each user only needs to use their college login username and password to access the server. To grant access to a new user, edit the following file using vim or nano sudo vim /etc/group and add the username to the ssh group.