Yelp/mrjob

lock clusters with EMR tags, not S3

coyotemarin opened this issue · 8 comments

Currently, the code to "lock" pooled clusters so that two jobs won't get submitted to the same cluster simultaneously uses S3, and makes the implicit assumption that everyone using the same pool will use the same cloud_tmp_dir.

Instead, we should "lock" clusters with EMR tags, with a format something like __mrjob_pool_lock_<job key>=<timestamp>.

Seems like __mrjob_pool_lock_<job_key> would be valid.

It might be just as efficient in terms of API calls to have a single tag, __mrjob_pool_lock, which we set to our unique job key + a timestamp. To do this we need to:

  • DescribeCluster to make sure it's not already locked
  • AddTags to acquire the lock
  • DescribeCluster again to make sure we were the ones who acquired the lock

As opposed to optimistic locking where we just add our own __mrjob_pool_lock_<job_key> tag and then DescribeCluster to see who won.

However, we don't have to DescribeCluster when we release our lock, we just RemoveTags. We also don't have to worry about a cluster accumulating expired tags, since there's just one used for locking.

I was thinking this might introduce a clock skew issue, but it turns out the S3 code depends on the local clock as well.

It's also not clear to me that we ever need to release the lock, since it only lasts a minute, and most real jobs will take at least that long. This is mostly an issue for testing pooling, but we could handle this by patching lock expiration time to be negative in the tests.

API calls can be delayed. Thinking about timing, we want to check and tag the cluster as quickly as possible, and then pause a bit and check the cluster again to make sure that another job didn't tag the cluster after us. We probably want something like:

  • 5 seconds to describe and tag the cluster
  • 10 seconds pause
  • no more than 5 seconds to check the cluster again to ensure that our tag wasn't overwritten
  • effectively, 40 more seconds to submit our steps and have the cluster acknowledge them

This isn't totally foolproof; for example submitting steps could get throttled and our lock could expire somewhere in the middle of it, in which case we'd probably be better off possibly sharing the cluster with another pooled job than cancelling our steps. But the first three steps can be time-constrained.

We might be better off using the "raw" boto3 client rather than our wrapped connection for the locking step. boto3 itself includes retries, but you can turn them off; see the boto3 config documentation for more details (you pass this as the config keyword arg when creating a client).

Fixed by #2167.