Yelp/mrjob

put most pooling info in cluster name

coyotemarin opened this issue · 14 comments

We can make cluster pooling use less API calls by not listing clusters' steps (see #2159). However, if we want to be really efficient, we can include all the relevant information in the cluster's name, which is available from the ListClusters API call and can contain up to 256 characters. Currently, clusters have a "pool hash" which encapsulates bootstrapping information, but we could put everything we need to match exactly about the cluster (e.g. AMI version) into a single hash. Depending on space, we could even include information about total CPU, minimum instance memory, etc. in the cluster's name.

We need to DescribeCluster for any cluster we intend to join anyhow so that we can look at the __mrjob_pool_lock tag, so we can look at Applications then (jobs may join a cluster that has a superset of our required operations).

We could also just make Applications an exact match; it's similar to bootstrapping or AMI in that way.

Looking at _compare_cluster_setup(), here are the pool matching rules that aren't an exact match:

  • Applications: okay to join cluster with additional applications
  • RunningAmiVersion: okay to partially specify AMI version (e.g. 2.4 matches 2.4.8)
  • EbsRootVolumeSize: if not using default, okay if bigger
  • instances:
    • for instance groups, memory/cpus/bid price for each of master/core/task
    • for instance fleets, it's complicated

Additionally, available clusters are sorted, but only by their CPU and memory capacity; jobs try to pick the "best" cluster, even if they wouldn't have procured one that large themselves.

The RunningAmiVersion thing is years out-of-date; there's no reason to support partial AMI versions at all. Nowadays, we just use release labels, which are always exact.

Applications very well could be exact; it's analogous to AMI setup

Really we're just trying to find ourselves a way to avoid calling DescribeCluster on a cluster that we can't join. Once we decide to join a cluster, we have to describe it anyways to "lock" it.

ListClusters will show up to 50 clusters in one page, and can filter by cluster state. Probably most users won't have more than 50 clusters in the WAITING state, so there's not much point in trying to optimize ListClusters (for example, not bothering to look at clusters in the second page until we've exhausted all the ones in the first).

One thing we can't really avoid is checking what subnet a cluster is in. You can tell EMR to start a job in one of a list of subnets, and you won't know what subnet it's actually in until the cluster launches, at which point you can't change the cluster name. We could exact-match on the list of subnets, but that seems a bit annoying to understand and debug.

Shoot, there might not be enough space in the name to encode EBS volume information for each instance role. Each instance role may be configured with any number of volumes, and each of these has volume type, size, and if volume type is io1, iops.

Might be best to put enough information in the cluster name to give the user about which clusters will probably work and which one is "best" in terms of memory and CPU. Then before joining a cluster, we check its instance groups/fleets and (when "locking" it) the cluster info we look at in _check_cluster_setup.

Just putting the pool hash in the job name would be a big win, honestly. And then we could add information to that hash.

NormalizedInstanceHours divided by whole number of hours of run time might be sufficient for choosing the "best" instance. And ListClusters returns this information.

Fixed by #2174