Creating a cluster fails when using `SchedulingTarget.Master` and a cluster ID previously used by a now-deleted, older cluster
jamesclarke opened this issue · 4 comments
Background
In certain cases, such as automated daily-run data processing jobs, we want to:
- use the AZTK SDK to create a cluster with the
SchedulingTarget.Master
option so that our driver always runs on the master; - use a pre-determined, known cluster ID rather than generating one at runtime so that if the data processing job errors, we know what cluster ID to check for to see if the cluster already exists when we re-run the job.
AZTK version
v0.10.1
release, installed from PyPI using pip
Issue
It seems that:
- when a cluster is created with the option
scheduling_target=SchedulingTarget.Master
, a 'task table' is created in the storage account's table service, using a hashed version of the cluster ID as its ID; - when the cluster is deleted, the task table is not deleted;
- I traced the code through to a call to
aztk.client.cluster.helpers.delete.delete_pool_and_job_and_table()
and for some reasons this seems not to delete the task table but without failing/raising an error.
- I traced the code through to a call to
- when a later cluster is created with the same cluster ID, the cluster-creation code attempts to create a new task table with the hashed cluster ID but fails when it finds a table with that ID already exists, raising the following error:
-
AztkError: Conflict {"odata.error":{"code":"TableAlreadyExists","message":{"lang":"en-US", "value":"The table specified already exists.\nRequestId:a1b5e2f8-9002- 0110-1327-8e86f6000000\nTime:2018-12-07T12:20:47.0946074Z"}}}
-
Steps to reproduce (using the AZTK SDK)
- Create a cluster with:
- a set name (e.g.
test-aztk-cluster
) - the option
scheduling_target=SchedulingTarget.Master
- a set name (e.g.
- <Do whatever>
- Delete the cluster.
- Wait a while, until the underlying Azure Batch pool and job have been deleted.
- Check the Azure Storage table used to track tasks and see that it has not been deleted.
- Create another cluster with the same name (
test-aztk-cluster
) andscheduling_target=SchedulingTarget.Master
- See cluster creation fail with an error like:
-
AztkError: Conflict {"odata.error":{"code":"TableAlreadyExists","message":{"lang":"en-US", "value":"The table specified already exists.\nRequestId:a1b5e2f8-9002- 0110-1327-8e86f6000000\nTime:2018-12-07T12:20:47.0946074Z"}}}
-
- Delete the cluster (i.e. Azure Batch pool and job).
- Go to the Azure dashboard and manually delete the Azure Storage task table.
- Once everything is finished being deleted, create the same cluster again.
- See that this time cluster creation works without error.
- Delete the cluster and wait for the pool and job to be deleted but this time do not manually delete the task table.
- Now create a cluster with the same name (
test-aztk-cluster
) but, this time, use the optionscheduling_target=SchedulingTarget.Any
. - See that this time, although the task table is still there, the cluster is created without error (as a task table is not used with the option
SchedulingTarget.Any
)
Error logs
---------------------------------------------------------------------------
AzureConflictHttpError Traceback (most recent call last)
.../lib/python3.6/site-packages/aztk/utils/try_func.py in wrapper(*args, **kwargs)
7 try:
----> 8 return function(*args, **kwargs)
9 except catch_exceptions as e:
.../lib/python3.6/site-packages/aztk/utils/retry.py in wrapper(*args, **kwargs)
16 try:
---> 17 return function(*args, **kwargs)
18 except exceptions:
.../lib/python3.6/site-packages/aztk/client/base/helpers/task_table.py in create_task_table(table_service, id)
66 """
---> 67 return table_service.create_table(helpers.convert_id_to_table_id(id), fail_on_exist=True)
68
.../lib/python3.6/site-packages/azure/cosmosdb/table/tableservice.py in create_table(self, table_name, fail_on_exist, timeout)
541 else:
--> 542 self._perform_request(request)
543 return True
.../lib/python3.6/site-packages/azure/cosmosdb/table/tableservice.py in _perform_request(self, request, parser, parser_args, operation_context)
1105 _update_storage_table_header(request)
-> 1106 return super(TableService, self)._perform_request(request, parser, parser_args, operation_context)
.../lib/python3.6/site-packages/azure/storage/common/storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
429 exception_str_in_one_line)
--> 430 raise ex
431 finally:
.../lib/python3.6/site-packages/azure/storage/common/storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
357 retry_context.exception = ex
--> 358 raise ex
359 except Exception as ex:
.../lib/python3.6/site-packages/azure/storage/common/storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
343 _http_error_handler(
--> 344 HTTPError(response.status, response.message, response.headers, response.body))
345
.../lib/python3.6/site-packages/azure/storage/common/_error.py in _http_error_handler(http_error)
114
--> 115 raise ex
116
AzureConflictHttpError: Conflict
{"odata.error":{"code":"TableAlreadyExists","message":{"lang":"en-US","value":"The table specified already exists.\nRequestId:a1b5e2f8-9002-0110-1327-8e86f6000000\nTime:2018-12-07T12:20:47.0946074Z"}}}
During handling of the above exception, another exception occurred:
AztkError Traceback (most recent call last)
<ipython-input-2-d2c3d2edc779> in <module>
...
.../lib/python3.6/site-packages/aztk/spark/client/cluster/operations.py in create(self, cluster_configuration, wait)
30 :obj:`aztk.spark.models.Cluster`: An Cluster object representing the state and configuration of the cluster.
31 """
---> 32 return create.create_cluster(self._core_cluster_operations, self, cluster_configuration, wait)
33
34 def delete(self, id: str, keep_logs: bool = False):
.../lib/python3.6/site-packages/aztk/spark/client/cluster/helpers/create.py in create_cluster(core_cluster_operations, spark_cluster_operations, cluster_conf, wait)
64
65 cluster = core_cluster_operations.create(cluster_conf, software_metadata_key, start_task,
---> 66 constants.SPARK_VM_IMAGE)
67
68 # Wait for the master to be ready
.../lib/python3.6/site-packages/aztk/client/cluster/operations.py in create(self, cluster_configuration, software_metadata_key, start_task, vm_image_model)
21 """
22 return create.create_pool_and_job_and_table(self, cluster_configuration, software_metadata_key, start_task,
---> 23 vm_image_model)
24
25 def get(self, id: str):
.../lib/python3.6/site-packages/aztk/client/cluster/helpers/create.py in create_pool_and_job_and_table(core_cluster_operations, cluster_conf, software_metadata_key, start_task, VmImageModel)
71 # create storage task table
72 if cluster_conf.scheduling_target != models.SchedulingTarget.Any:
---> 73 core_cluster_operations.create_task_table(cluster_conf.cluster_id)
74
75 return helpers.get_cluster(cluster_conf.cluster_id, core_cluster_operations.batch_client)
.../lib/python3.6/site-packages/aztk/client/base/base_operations.py in create_task_table(self, id)
233 id (:obj:`str`): the id of the cluster
234 """
--> 235 return task_table.create_task_table(self.table_service, id)
236
237 def list_task_table_entries(self, id):
.../lib/python3.6/site-packages/aztk/utils/try_func.py in wrapper(*args, **kwargs)
11 raise raise_exception(exception_formatter(e))
12 else:
---> 13 raise raise_exception(str(e))
14
15 return wrapper
AztkError: Conflict
{"odata.error":{"code":"TableAlreadyExists","message":{"lang":"en-US","value":"The table specified already exists.\nRequestId:a1b5e2f8-9002-0110-1327-8e86f6000000\nTime:2018-12-07T12:20:47.0946074Z"}}}
Thanks for reporting this with so much detail. I am investigating and will update when a fix is available.
I was able to reproduce the issue. I will release a patch for this as soon as possible.
@jamesclarke The root cause of the issue was that storage table detection for a cluster passed the wrong table id. As a result, storage tables were not deleted when clusters were deleted.
A new version (0.10.2) was released last Friday with a fix so that storage tables will be deleted properly. Unfortunately, for clusters deployed with previous versions, you will have to either re-run the aztk cluster delete command, or otherwise clean up the stranded storage tables. Let me know if you have any questions.
Thanks, @jafreck for the quick turnaround on this! That's great. I've upgraded to 0.10.2 and it all seems good so far.