triton/tut/slurm: how does priority work?
rkdarst opened this issue · 1 comments
rkdarst commented
From a interactive question: how is priority calculated? We shouldn't go into depth but it could be mentioned in 1-2 more sentences.
This bit of history is an old write-up about it, which was deprecated some time ago as the page was redundant: (new description shouldn't be this long, but it could be a faq?)
scicomp-docs/triton/usage/jobs.rst.old
Lines 189 to 273 in af73a37
Job priority | |
============ | |
Triton queues are not first-in first-out, but "fairshare". This means | |
that every person has a priority. The more you run the lower your | |
user priority. As time passes, your user priority increases again. | |
The longer a job waits in the queue, the higher its job priority goes. | |
So, in the long run (if everyone is submitting an never-ending stream | |
of jobs), everyone will get exactly their share. | |
Once there are priorities, then: jobs are scheduled in order of | |
priority, then any gaps are backfilled with any smaller jobs that can | |
fit in. So small jobs usually get scheduled fast regardless. | |
*Warning: from this point on, we get more and more technical, if you | |
really want to know the details. Summary at the end.* | |
What's a share? Currently shares are based on department and their | |
respective funding of Triton (``sshare``). Shares are shared among | |
everyone in the department, but each person has their own priority. | |
Thus, for medium users, the 2-week usage of the rest of your | |
department can affect how fast your jobs run. However, again, things | |
are balanced per-user within departments. (However, one heavy user in | |
a department can affect all others in that department a bit too much, | |
we are working on this) | |
Your priority goes down via the "job billing": roughly time×power. | |
CPUs are billed at 1/s (but older, less powerful CPUs cost less!). | |
Memory costs .2/GB/s. But: you only get billed for the max of memory | |
or CPU. So if you use one CPU and all the memory (so that no one else | |
can run on it), you get billed for all memory but no CPU. Same for | |
all CPUs and little memory. This encourages balanced use. (this also | |
applies to GPUs). | |
GPUs also have a billing weight, currently tens of times higher than a | |
CPU billing weight for the newest GPUs. (In general all of these can | |
change, for the latest info see search ``BillingWeights`` in | |
``/etc/slurm/slurm.conf``). | |
If you submit a long job but it ends early, you are only billed for | |
the actual time you use (but the longer job might take longer to start | |
at the beginning). Memory is always billed for the full reservation | |
even if you use less, since it isn't shared. | |
The "user priority" is actually just a record how much you have | |
consumed lately (the billing numbers above). This number goes down | |
with a half-life decay of 2 weeks. Your personal priority your share | |
compared to that, so we get the effect described above: the more you | |
(or your department) runs lately, the lower your priority. | |
If you want your stuff to run faster, the best way is to more | |
accurately specify your time (may make that job can find a place | |
sooner) and memory (avoids needlessly wasting your priority). | |
While your job is pending in the queue SLURM checks those metrics | |
regularly and recalculates job priority constantly. If you are | |
interested in details, take a look at `multifactor priority plugin | |
<https://slurm.schedmd.com/priority_multifactor.html>`__ page (general | |
info) and `depth-oblivious fair-share factor | |
<https://slurm.schedmd.com/priority_multifactor3.html>`__ for what we | |
use specifically (warning: very in depth page). On Triton, you can | |
always see the latest billing weights in ``/etc/slurm/slurm.conf`` | |
Numerically, job priorities range from 0 to 2^32-1. Higher is | |
sooner to run, but really the number doesn't mean much itself. | |
These commands can show you information about your user and job | |
priorities: | |
.. csv-table:: | |
:delim: | | |
``slurm s`` | list of jobs per user with their current priorities | |
``slurm full`` | as above but almost all of the job parameters are listed | |
``slurm shares`` | displays usage (RawUsage) and current FairShare weights (FairShare, higher is better) values for all users | |
``slurm j <jobid>`` | shows ``<jobid>`` detailed info including priority, requested nodes etc. | |
.. | |
``slurm p gpu`` | # shows partition parameters incl. Priority= | |
tl;dr: Just select the resources you think you need, and slurm | |
tries to balance things out so everyone gets their share. The best | |
way to maintain high priority is to use resources efficiently so you | |
don't need to over-request. |
rkdarst commented
The old text was copied here, along with more discussion. This could be linked from somewhere:
https://aaltoscicomp.github.io/blog/2024/how-busy-is-the-cluster/