Azure/aztk

Creating a cluster fails when using `SchedulingTarget.Master` and a cluster ID previously used by a now-deleted, older cluster

jamesclarke opened this issue · 4 comments

Background

In certain cases, such as automated daily-run data processing jobs, we want to:

  • use the AZTK SDK to create a cluster with the SchedulingTarget.Master option so that our driver always runs on the master;
  • use a pre-determined, known cluster ID rather than generating one at runtime so that if the data processing job errors, we know what cluster ID to check for to see if the cluster already exists when we re-run the job.

AZTK version

v0.10.1 release, installed from PyPI using pip

Issue

It seems that:

  • when a cluster is created with the option scheduling_target=SchedulingTarget.Master, a 'task table' is created in the storage account's table service, using a hashed version of the cluster ID as its ID;
  • when the cluster is deleted, the task table is not deleted;
    • I traced the code through to a call to aztk.client.cluster.helpers.delete.delete_pool_and_job_and_table() and for some reasons this seems not to delete the task table but without failing/raising an error.
  • when a later cluster is created with the same cluster ID, the cluster-creation code attempts to create a new task table with the hashed cluster ID but fails when it finds a table with that ID already exists, raising the following error:
    • AztkError: Conflict
      {"odata.error":{"code":"TableAlreadyExists","message":{"lang":"en-US",
      "value":"The table specified already exists.\nRequestId:a1b5e2f8-9002-
      0110-1327-8e86f6000000\nTime:2018-12-07T12:20:47.0946074Z"}}}
      

Steps to reproduce (using the AZTK SDK)

  1. Create a cluster with:
    • a set name (e.g. test-aztk-cluster)
    • the option scheduling_target=SchedulingTarget.Master
  2. <Do whatever>
  3. Delete the cluster.
  4. Wait a while, until the underlying Azure Batch pool and job have been deleted.
  5. Check the Azure Storage table used to track tasks and see that it has not been deleted.
  6. Create another cluster with the same name (test-aztk-cluster) and scheduling_target=SchedulingTarget.Master
  7. See cluster creation fail with an error like:
    • AztkError: Conflict
      {"odata.error":{"code":"TableAlreadyExists","message":{"lang":"en-US",
      "value":"The table specified already exists.\nRequestId:a1b5e2f8-9002-
      0110-1327-8e86f6000000\nTime:2018-12-07T12:20:47.0946074Z"}}}
      
  8. Delete the cluster (i.e. Azure Batch pool and job).
  9. Go to the Azure dashboard and manually delete the Azure Storage task table.
  10. Once everything is finished being deleted, create the same cluster again.
  11. See that this time cluster creation works without error.
  12. Delete the cluster and wait for the pool and job to be deleted but this time do not manually delete the task table.
  13. Now create a cluster with the same name (test-aztk-cluster) but, this time, use the option scheduling_target=SchedulingTarget.Any.
  14. See that this time, although the task table is still there, the cluster is created without error (as a task table is not used with the option SchedulingTarget.Any)

Error logs

---------------------------------------------------------------------------
AzureConflictHttpError                    Traceback (most recent call last)
.../lib/python3.6/site-packages/aztk/utils/try_func.py in wrapper(*args, **kwargs)
      7             try:
----> 8                 return function(*args, **kwargs)
      9             except catch_exceptions as e:

.../lib/python3.6/site-packages/aztk/utils/retry.py in wrapper(*args, **kwargs)
     16                 try:
---> 17                     return function(*args, **kwargs)
     18                 except exceptions:

.../lib/python3.6/site-packages/aztk/client/base/helpers/task_table.py in create_task_table(table_service, id)
     66     """
---> 67     return table_service.create_table(helpers.convert_id_to_table_id(id), fail_on_exist=True)
     68 

.../lib/python3.6/site-packages/azure/cosmosdb/table/tableservice.py in create_table(self, table_name, fail_on_exist, timeout)
    541         else:
--> 542             self._perform_request(request)
    543             return True

.../lib/python3.6/site-packages/azure/cosmosdb/table/tableservice.py in _perform_request(self, request, parser, parser_args, operation_context)
   1105         _update_storage_table_header(request)
-> 1106         return super(TableService, self)._perform_request(request, parser, parser_args, operation_context)

.../lib/python3.6/site-packages/azure/storage/common/storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
    429                                  exception_str_in_one_line)
--> 430                     raise ex
    431             finally:

.../lib/python3.6/site-packages/azure/storage/common/storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
    357                     retry_context.exception = ex
--> 358                     raise ex
    359                 except Exception as ex:

.../lib/python3.6/site-packages/azure/storage/common/storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
    343                         _http_error_handler(
--> 344                             HTTPError(response.status, response.message, response.headers, response.body))
    345 

.../lib/python3.6/site-packages/azure/storage/common/_error.py in _http_error_handler(http_error)
    114 
--> 115     raise ex
    116 

AzureConflictHttpError: Conflict
{"odata.error":{"code":"TableAlreadyExists","message":{"lang":"en-US","value":"The table specified already exists.\nRequestId:a1b5e2f8-9002-0110-1327-8e86f6000000\nTime:2018-12-07T12:20:47.0946074Z"}}}

During handling of the above exception, another exception occurred:

AztkError                                 Traceback (most recent call last)
<ipython-input-2-d2c3d2edc779> in <module>

...

.../lib/python3.6/site-packages/aztk/spark/client/cluster/operations.py in create(self, cluster_configuration, wait)
     30             :obj:`aztk.spark.models.Cluster`: An Cluster object representing the state and configuration of the cluster.
     31         """
---> 32         return create.create_cluster(self._core_cluster_operations, self, cluster_configuration, wait)
     33 
     34     def delete(self, id: str, keep_logs: bool = False):

.../lib/python3.6/site-packages/aztk/spark/client/cluster/helpers/create.py in create_cluster(core_cluster_operations, spark_cluster_operations, cluster_conf, wait)
     64 
     65         cluster = core_cluster_operations.create(cluster_conf, software_metadata_key, start_task,
---> 66                                                  constants.SPARK_VM_IMAGE)
     67 
     68         # Wait for the master to be ready

.../lib/python3.6/site-packages/aztk/client/cluster/operations.py in create(self, cluster_configuration, software_metadata_key, start_task, vm_image_model)
     21         """
     22         return create.create_pool_and_job_and_table(self, cluster_configuration, software_metadata_key, start_task,
---> 23                                                     vm_image_model)
     24 
     25     def get(self, id: str):

.../lib/python3.6/site-packages/aztk/client/cluster/helpers/create.py in create_pool_and_job_and_table(core_cluster_operations, cluster_conf, software_metadata_key, start_task, VmImageModel)
     71     # create storage task table
     72     if cluster_conf.scheduling_target != models.SchedulingTarget.Any:
---> 73         core_cluster_operations.create_task_table(cluster_conf.cluster_id)
     74 
     75     return helpers.get_cluster(cluster_conf.cluster_id, core_cluster_operations.batch_client)

.../lib/python3.6/site-packages/aztk/client/base/base_operations.py in create_task_table(self, id)
    233             id (:obj:`str`): the id of the cluster
    234         """
--> 235         return task_table.create_task_table(self.table_service, id)
    236 
    237     def list_task_table_entries(self, id):

.../lib/python3.6/site-packages/aztk/utils/try_func.py in wrapper(*args, **kwargs)
     11                     raise raise_exception(exception_formatter(e))
     12                 else:
---> 13                     raise raise_exception(str(e))
     14 
     15         return wrapper

AztkError: Conflict
{"odata.error":{"code":"TableAlreadyExists","message":{"lang":"en-US","value":"The table specified already exists.\nRequestId:a1b5e2f8-9002-0110-1327-8e86f6000000\nTime:2018-12-07T12:20:47.0946074Z"}}}

Thanks for reporting this with so much detail. I am investigating and will update when a fix is available.

I was able to reproduce the issue. I will release a patch for this as soon as possible.

@jamesclarke The root cause of the issue was that storage table detection for a cluster passed the wrong table id. As a result, storage tables were not deleted when clusters were deleted.

A new version (0.10.2) was released last Friday with a fix so that storage tables will be deleted properly. Unfortunately, for clusters deployed with previous versions, you will have to either re-run the aztk cluster delete command, or otherwise clean up the stranded storage tables. Let me know if you have any questions.

Thanks, @jafreck for the quick turnaround on this! That's great. I've upgraded to 0.10.2 and it all seems good so far.