aws-samples/aws-eda-slurm-cluster

reducing number of compute resources to aggressively.

gwolski opened this issue · 2 comments

I'm building a cluster with just nine instance types and certain instances are being culled to "reduce number of CRs" - this is unnecessary as I do not have many compute resources.

Config file has:

InstanceConfig:
UseSpot: false
NodeCounts:
# @todo: Update the max number of each instance type to configure
DefaultMaxCount: 10
Include:
InstanceTypes:
- m7a.large
- m7a.xlarge
- m7a.2xlarge
- m7a.4xlarge
- r7a.large
- r7a.xlarge
- r7a.2xlarge
- r7a.4xlarge
- r7a.8xlarge

It then buckets appropriately:
INFO: Instance type by memory and core:
INFO: 6 unique memory size:
INFO: 8 GB
INFO: 1 instance type with 2 core(s): ['m7a.large']
INFO: 16 GB
INFO: 1 instance type with 2 core(s): ['r7a.large']
INFO: 1 instance type with 4 core(s): ['m7a.xlarge']
INFO: 32 GB
INFO: 1 instance type with 4 core(s): ['r7a.xlarge']
INFO: 1 instance type with 8 core(s): ['m7a.2xlarge']
INFO: 64 GB
INFO: 1 instance type with 8 core(s): ['r7a.2xlarge']
INFO: 1 instance type with 16 core(s): ['m7a.4xlarge']
INFO: 128 GB
INFO: 1 instance type with 16 core(s): ['r7a.4xlarge']
INFO: 256 GB
INFO: 1 instance type with 32 core(s): ['r7a.8xlarge']

But then it starts culling unnecessarily as parallecluster/slurm can handle 9 compute resources...

INFO: Configuring od-8-gb queue:
INFO: Adding od-8gb-2-cores compute resource: ['m7a.large']
INFO: Configuring od-16-gb queue:
INFO: Adding od-16gb-2-cores compute resource: ['r7a.large']
INFO: Skipping od-16gb-4-cores compute resource: ['m7a.xlarge'] to reduce number of CRs.
INFO: Configuring od-32-gb queue:
INFO: Adding od-32gb-4-cores compute resource: ['r7a.xlarge']
INFO: Skipping od-32gb-8-cores compute resource: ['m7a.2xlarge'] to reduce number of CRs.
INFO: Configuring od-64-gb queue:
INFO: Adding od-64gb-8-cores compute resource: ['r7a.2xlarge']
INFO: Skipping od-64gb-16-cores compute resource: ['m7a.4xlarge'] to reduce number of CRs.
INFO: Configuring od-128-gb queue:
INFO: Adding od-128gb-16-cores compute resource: ['r7a.4xlarge']
INFO: Configuring od-256-gb queue:
INFO: Adding od-256gb-32-cores compute resource: ['r7a.8xlarge']
INFO: Created 6 queues with 6 compute resources

I would like to have a 16 core 64G machine, a 32G 8 core machine, etc.. How to disable/modify this "culling". I would argue we should only start culling when we exceed what parallelcluster can handle.

We can now have 50 slurm queues per cluster, and 50 compute resources per queue and 50 compute resources per cluster! See:
https://docs.aws.amazon.com/parallelcluster/latest/ug/configuration-of-multiple-queues-v3.html

I've found by comment out the following three lines in source/cdk/cdk_slurm_stack.py I could turn off the reduction code (at line 2770 in the code I have):

                    if len(parallel_cluster_queue['ComputeResources']):
                        logger.info(f"    Skipping {compute_resource_name:18} compute resource: {instance_types} to reduce number of CRs.")
                        continue

The next line checks if I've exceed the MAX_NUMBER_OF_COMPUTE_RESOURCES, so there is a nice check in case my configuration were to be too much.

I want to be able to have machines with the same cores and less memory - no need to pay for more than I need.

I was trying to configure as many instance types as allowed by ParallelCluster's limits, but in retrospect, should really leave this up to the user to configure.

I've changed the code to just create 1 instance type per CR and 1 CR per queue/partition.
This should allow you to pick and choose which instances.
It is now an error if you configure too many instance types and you must either remove included instances or exclude instances until you get under the ParallelCluster limit of 50.