lock clusters with EMR tags, not S3

Question

lock clusters with EMR tags, not S3

coyotemarin opened this issue 4 years ago · 8 comments

Currently, the code to "lock" pooled clusters so that two jobs won't get submitted to the same cluster simultaneously uses S3, and makes the implicit assumption that everyone using the same pool will use the same cloud_tmp_dir.

Instead, we should "lock" clusters with EMR tags, with a format something like __mrjob_pool_lock_<job key>=<timestamp>.

Answer 1 · 2020-04-18T01:21:28.000Z

Limitations on EMR tag names:

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-tags-restrictions.html

Answer 2 · 2020-04-18T01:22:11.000Z

Seems like __mrjob_pool_lock_<job_key> would be valid.

Answer 3 · 2020-04-18T01:31:58.000Z

It might be just as efficient in terms of API calls to have a single tag, __mrjob_pool_lock, which we set to our unique job key + a timestamp. To do this we need to:

DescribeCluster to make sure it's not already locked
AddTags to acquire the lock
DescribeCluster again to make sure we were the ones who acquired the lock

As opposed to optimistic locking where we just add our own __mrjob_pool_lock_<job_key> tag and then DescribeCluster to see who won.

However, we don't have to DescribeCluster when we release our lock, we just RemoveTags. We also don't have to worry about a cluster accumulating expired tags, since there's just one used for locking.

Answer 4 · 2020-04-18T01:34:26.000Z

I was thinking this might introduce a clock skew issue, but it turns out the S3 code depends on the local clock as well.

Answer 5 · 2020-04-18T01:39:01.000Z

It's also not clear to me that we ever need to release the lock, since it only lasts a minute, and most real jobs will take at least that long. This is mostly an issue for testing pooling, but we could handle this by patching lock expiration time to be negative in the tests.

Answer 6 · 2020-04-18T18:01:40.000Z

API calls can be delayed. Thinking about timing, we want to check and tag the cluster as quickly as possible, and then pause a bit and check the cluster again to make sure that another job didn't tag the cluster after us. We probably want something like:

5 seconds to describe and tag the cluster
10 seconds pause
no more than 5 seconds to check the cluster again to ensure that our tag wasn't overwritten
effectively, 40 more seconds to submit our steps and have the cluster acknowledge them

This isn't totally foolproof; for example submitting steps could get throttled and our lock could expire somewhere in the middle of it, in which case we'd probably be better off possibly sharing the cluster with another pooled job than cancelling our steps. But the first three steps can be time-constrained.

Answer 7 · 2020-04-18T18:05:16.000Z

We might be better off using the "raw" boto3 client rather than our wrapped connection for the locking step. boto3 itself includes retries, but you can turn them off; see the boto3 config documentation for more details (you pass this as the config keyword arg when creating a client).

Answer 8 · 2020-05-28T22:07:52.000Z

Fixed by #2167.