Yelp/mrjob

don't list steps when pooling

Closed this issue · 1 comments

The pooling code uses API calls rather inefficiently. To find a cluster to join we:

  • use ListClusters to find all clusters in the WAITING state (in _usable_clusters())
  • for each of these clusters, call DescribeCluster (also in _usable_clusters(), see _check_cluster_state() for the comparison)
  • list all the cluster's steps (in _check_cluster_state())
  • for each cluster that passes this step, get all its instance groups/fleets (in _compare_cluster_setup())

There are several reasons we check a cluster's steps, and it turns out they're all outdated. First, we don't want to exceed the limit of 256 steps per cluster, but that only applies to clusters running AMIs prior to 3.1.1 (and 2.x clusters prior to 2.4.8). We could get around this by simply preventing the user from using pooling with such outdated AMIs.

Second, there is some defensive code (in _check_cluster_state()) that worried about clusters having steps but failing to run them. This was always a somewhat mysterious issue, but it's several years old, and we can assume it's been fixed.

Finally, we want to know which step number the cluster is on because we included that in the S3 URI we use to "lock" the cluster. The reason for this is obscure, and goes back to the original locking code. Maybe the idea was to make locks self-expire once a step was added? In any case, there are many other ways to implement lock expiry, and they don't require knowledge of the step number.

Yep, the goal was to make locks "expire" when steps are added. So before we eliminate step num from locking, we need to have a mechanism for releasing cluster locks; see #2162.