put most pooling info in cluster name

Question

put most pooling info in cluster name

coyotemarin opened this issue 5 years ago · 14 comments

We can make cluster pooling use less API calls by not listing clusters' steps (see #2159). However, if we want to be really efficient, we can include all the relevant information in the cluster's name, which is available from the ListClusters API call and can contain up to 256 characters. Currently, clusters have a "pool hash" which encapsulates bootstrapping information, but we could put everything we need to match exactly about the cluster (e.g. AMI version) into a single hash. Depending on space, we could even include information about total CPU, minimum instance memory, etc. in the cluster's name.

Answer 1 · 2020-04-20T17:31:45.000Z

We need to DescribeCluster for any cluster we intend to join anyhow so that we can look at the __mrjob_pool_lock tag, so we can look at Applications then (jobs may join a cluster that has a superset of our required operations).

Answer 2 · 2020-04-20T17:38:47.000Z

We could also just make Applications an exact match; it's similar to bootstrapping or AMI in that way.

Answer 3 · 2020-04-22T19:51:14.000Z

Looking at _compare_cluster_setup(), here are the pool matching rules that aren't an exact match:

Applications: okay to join cluster with additional applications
RunningAmiVersion: okay to partially specify AMI version (e.g. 2.4 matches 2.4.8)
EbsRootVolumeSize: if not using default, okay if bigger
instances:
- for instance groups, memory/cpus/bid price for each of master/core/task
- for instance fleets, it's complicated

Answer 4 · 2020-04-22T19:51:56.000Z

Additionally, available clusters are sorted, but only by their CPU and memory capacity; jobs try to pick the "best" cluster, even if they wouldn't have procured one that large themselves.

Answer 5 · 2020-04-22T19:52:28.000Z

The RunningAmiVersion thing is years out-of-date; there's no reason to support partial AMI versions at all. Nowadays, we just use release labels, which are always exact.

Answer 6 · 2020-04-22T19:57:19.000Z

Applications very well could be exact; it's analogous to AMI setup

Answer 7 · 2020-05-16T15:17:11.000Z

Really we're just trying to find ourselves a way to avoid calling DescribeCluster on a cluster that we can't join. Once we decide to join a cluster, we have to describe it anyways to "lock" it.

Answer 8 · 2020-05-16T15:19:03.000Z

ListClusters will show up to 50 clusters in one page, and can filter by cluster state. Probably most users won't have more than 50 clusters in the WAITING state, so there's not much point in trying to optimize ListClusters (for example, not bothering to look at clusters in the second page until we've exhausted all the ones in the first).

Answer 9 · 2020-05-16T15:25:08.000Z

One thing we can't really avoid is checking what subnet a cluster is in. You can tell EMR to start a job in one of a list of subnets, and you won't know what subnet it's actually in until the cluster launches, at which point you can't change the cluster name. We could exact-match on the list of subnets, but that seems a bit annoying to understand and debug.

Answer 10 · 2020-05-16T15:37:37.000Z

Shoot, there might not be enough space in the name to encode EBS volume information for each instance role. Each instance role may be configured with any number of volumes, and each of these has volume type, size, and if volume type is io1, iops.

Answer 11 · 2020-05-16T15:41:09.000Z

Might be best to put enough information in the cluster name to give the user about which clusters will probably work and which one is "best" in terms of memory and CPU. Then before joining a cluster, we check its instance groups/fleets and (when "locking" it) the cluster info we look at in _check_cluster_setup.

Answer 12 · 2020-05-16T15:41:49.000Z

Just putting the pool hash in the job name would be a big win, honestly. And then we could add information to that hash.

Answer 13 · 2020-05-16T15:50:21.000Z

NormalizedInstanceHours divided by whole number of hours of run time might be sufficient for choosing the "best" instance. And ListClusters returns this information.

Answer 14 · 2020-05-28T22:30:44.000Z

Fixed by #2174