aws-samples/aws-eda-slurm-cluster

[FEATURE] Create partitions with number of cores and amount of memory in name

Closed this issue · 0 comments

Is your feature request related to a problem? Please describe.

Right now the node names only contain the name of the instance type and don't tell how much memory and number of cores. Since the name is duplicated in both the queue and resource the nodename contains duplicate info.
Can we revert to a naming convention that includes the number of cores and amount of memory.

Relates to #235
Relates to #261

The queues and CRs result in node names that look like the following:

od-r7a-l                 up   infinite     10  idle~ od-r7a-l-dy-od-r7a-l-[1-10]
od-r7i-l                 up   infinite     10  idle~ od-r7i-l-dy-od-r7i-l-[1-10]
od-r7iz-l                up   infinite     10  idle~ od-r7iz-l-dy-od-r7iz-l-[1-10]
od-16-gb                 up   infinite     30  idle~ od-r7a-l-dy-od-r7a-l-[1-10],od-r7i-l-dy-od-r7i-l-[1-10],od-r7iz-l-dy-od-r7iz-l-[1-10]

Note that the nodename is the concatenation of the queue and CR names.
So, there is an opportunity to encode the instance type attributes (cores and memory) in one of those names.
Since multiple instance types can have the same core/mem configuration I need to do one of 2 things:

  1. Create 1 queue like 16-gb-1-cores for similar configurations with 1 or more CRs named by instance type.
  2. Create a queue named by instance type and name the CR by configuration. This would only work if CR names can be duplicated.

I kind of like the second because it allows the most flexibility.
You can easily create new partitions that combine node sets of other partitions, but splitting them up isn't as easy.
That could be mitigated somewhat by creating nodesets that are based on different attributes such as number of cores and amount of memory.
This could also be advantageous since APC doesn't currently support custom features on nodes so the easiest way to create custom "features" is using nodesets and partitions.

Also remove the purchase option for the CR since it's already in the queue name.

Option 1:

od-16-gb-1-cores   up   infinite     10  idle~ od-16-gb-1-cores-dy-r7i-l-[1-10],od-16-gb-1-cores-od-r7iz-l-[1-10]
od-16-gb-2-cores   up   infinite     10  idle~ od-16-gb-1-cores-dy-r7a-l-[1-10]
od-16-gb           up   infinite     30  idle~ od-r7a-l-dy-r7a-l-[1-10],od-r7i-l-dy-r7i-l-[1-10],od-r7iz-l-dy-r7iz-l-[1-10]

Option 2:

od-r7a-l           up   infinite     10  idle~ od-r7a-l-dy-16-gb-2-cores-[1-10]
od-r7i-l           up   infinite     10  idle~ od-r7i-l-dy-16-gb-1-cores-[1-10]
od-r7iz-l          up   infinite     10  idle~ od-r7iz-l-dy-16-gb-1-cores-[1-10]
od-16-gb           up   infinite     30  idle~ od-r7a-l-dy-16-gb-2-cores-[1-10],od-r7i-l-dy-16-gb-1-cores-[1-10],od-r7iz-l-dy-16-gb-1-cores-[1-10]
od-16-gb-1-cores   up   infinite     30  idle~ od-r7i-l-dy-16-gb-1-cores-[1-10],od-r7iz-l-dy-16-gb-1-cores-[1-10]
od-16-gb-2-cores   up   infinite     30  idle~ od-r7a-l-dy-16-gb-2-cores-[1-10]