/SlurmMonitor.jl

A monitor for SLURM based HPC systems that (without sudo/root) tests if nodes are healthy, and notifies slack on failure. Mirror of https://github.com/bencardoen/SlurmMonitor.jl

Primary LanguageJuliaGNU Affero General Public License v3.0AGPL-3.0

SlurmMonitor

DOI

SlurmMonitor monitors SLURM (an HPC scheduler) based clusters for status, records the data over time, and if configured can act on predefined conditions.

Linking to Slack

** You need admin rights to do this, and do not create public endpoints without realizing what they (can) do**

  • Login to Slack
  • Settings and Admin
  • "Manage Apps"
  • "Build"
  • Create a new App
  • Activate new webhook

Test if link works

curl -X POST -H 'Content-type: application/json' --data '{"text":"Hello, World!"}' $URL

Installation

You install the monitor on a login node, and this assumes HPC admins are ok with you doing this.

git clone <thisrepo>
cd SlurmMonitor.jl

Then start julia

julia
julia> using Pkg; Pkg.add(".")

or

julia

then

julia>using Pkg; Pkg.activate() # Activate env in current dir, optional
julia>using Pkg; Pkg.add(url=<thisrepo>)

Test integration with slack

julia --project=.  # assuming you're in the cloned directory

Then

using SlurmMonitor
endpoint=readendpoint("endpoint.txt")
posttoslack("42 is the answer", endpoint)

That either posts the message, or tells you why it couldn't. make sure the format of the url is /services/.../.../.. See slack app configuration page on how to fix this if invalid.

Usage

The monitor polls at intervals i, repeating r times, with minimum acceptable latency l and saving to output dir o. Triggers (node going down, latency spikes), trigger optional messages to Slack e. It needs and endpoint file (1 line), with a endpoint (see earlier). You'd use this within a tmux/screen session to keep it in the background.

Example

Every minute, for 1e4 minutes, run the monitor, and call Solar Slack if issues arise.

julia --project=. src/monitor.jl -i 60 -r 10000 -o . -e endpoint_solar.txt -l 40

This will save a csv file, every z seconds, for k iterations, where 1 line represents the state of each node in the cluster, recording total/free CPU/RAM/GPU and node status (IDLE, ALLOC, ...).

On specified conditions (IDLE->DOWN) will send messages to a linked Slackbot, configured with the right endpoint.

If a node is not responsive (by network), a similar trigger is fired. Define the mininum average latency you consider as not-reachable in CLI.

Output

Saved to observed_state.csv. Do Not move the csv file, it's continuously read/written to See src/SlurmMonitor.jl, e.g. summarizestate($DATAFRAME, $ENDPOINT).

using Pkg
Pkg.activate(".")
using DataFrames
using CSV
df = CSV.read("where.csv", DataFrame)
endpoint = readendpoint("whereendpointis.txt")
summarizestate(df, endpoint) ## Sends to slack
plotstats(df)  ## Plots in svg

Dependencies

Warning

If you run this on a cluster, make sure you're authorized to do so. Calling scontrol and sinfo are RPC calls that cause a non-trivial load on the scheduler, if the cluster has 1000s of nodes, and you set the interval to 1s, that means 2000 RPC calls/1. Note that it takes several seconds, if not more, for a node to change state anyway. Do not do this unless you're a cluster admin. Sane intervals are ~ 60-120 or more seconds.

Extra functionality

  • Triggers can be anything, currently node state and latency are used
  • Diskusage, nvidia drivers, etc are all implemented, not active (can trigger ssh lockout)
  • Contact me if you need those active

Troublehshooting

Times seem wrong

Times are recorded in UTC. If you want this differently, it's not hard, I'd happily accept a properly documented PR.

Cite

If you find this useful, please cite

@software{ben_cardoen_2022_7106106,
  author       = {Ben Cardoen},
  title        = {{SlurmMonitor.jl: A Slurm monitoring tool that
                   notifies slack on adverse SLURM HPC state changes
                   and records temporal statistics on utilization.}},
  month        = sep,
  year         = 2022,
  note         = {https://github.com/bencardoen/SlurmMonitor.jl},
  publisher    = {Zenodo},
  version      = {0.1.0},
  doi          = {10.5281/zenodo.7106106},
  url          = {https://doi.org/10.5281/zenodo.7106106}
}