Ad-hoc report: fox heavy GPU usage
Closed this issue · 2 comments
Experimental / speculative.
This comes out of https://gitlab.sigma2.no/naic/wp2/identify-most-resource-intensive-users/-/issues/1. We will try to create an ad-hoc report that:
- takes a time interval of interest (typically some range of dates) and a cluster name as an argument
- finds jobs that ran for at least 24h
- produces a list of jobs (user+job data) on that cluster that used at least one gpu-day over the lifetime of the job
- annotates a job with a mark if the job used one gpu-day in a 24h period
Soft dependencies (we can do without them for now):
The output from the prototype report looks like this (manually reformatted a little b/c good formatting is not currently implemented):
>24h User GpuTime GpuTime/ Host(s) Command
duration
ec-lgcharpe 438260s 97% gpu-1.fox python3,wandb-service(2
ec-nicoca 131084s 73% gpu-7.fox python3
ec-nicoca 310050s 73% gpu-7.fox python3
ec-nicoca 321326s 75% gpu-1.fox python3
* ec-thallesss 954508s 353% gpu-8.fox python,python_<defunct>,torchrun
* ec-thallesss 1495751s 349% gpu-2.fox python,python_<defunct>,torchrun
ec-dhananjt 115676s 27% gpu-12.fox python
ec-nicoca 317770s 73% gpu-7.fox python3
ec-abgani 371619s 86% gpu-7.fox 2_run.sh,pmemd.cuda,pmemd.cuda_<defunct>,slurm_script
ec-nicoca 165963s 73% gpu-7.fox python3
* ec-thallesss 1132497s 353% gpu-8.fox python,torchrun
* ec-thallesss 583203s 358% gpu-2.fox python,torchrun
The mark in the left column indicates that the job used at least one full GPU for at least 24h running, note it would also be marked if it used two GPUs at 60%, say, for the same period (120% total).
I guess that could usefully be showing time as dd:hh:mm:ss. 115675s (the shorter one) is 32.1 hours. 1495751s (the longer one) is 415.5 hours (divided by four cards, so over 100 hours of real time).
Plenty other information is available about these jobs, if of interest.