astronomer/astro-sdk

Load file S3 to databricks failing

sunank200 opened this issue · 1 comments

Describe the bug
A clear and concise description of what the bug is.
Load file from S3 to data bricks is failing intermittently:

Seems like this job was terminated with following error:
[2023-06-13, 07:57:27 UTC] {taskinstance.py:1916} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/astro/.local/lib/python3.10/site-packages/astro/sql/operators/load_file.py", line 131, in execute
    return self.load_data(input_file=self.input_file, context=context)
  File "/home/astro/.local/lib/python3.10/site-packages/astro/sql/operators/load_file.py", line 136, in load_data
    return self.load_data_to_table(input_file, context)
  File "/home/astro/.local/lib/python3.10/site-packages/astro/sql/operators/load_file.py", line 155, in load_data_to_table
    database.load_file_to_table(
  File "/home/astro/.local/lib/python3.10/site-packages/astro/databases/databricks/delta.py", line 151, in load_file_to_table
    load_file_to_delta(
  File "/home/astro/.local/lib/python3.10/site-packages/astro/databases/databricks/load_file/load_file_job.py", line 88, in load_file_to_delta
    create_and_run_job(
  File "/home/astro/.local/lib/python3.10/site-packages/astro/databases/databricks/api_utils.py", line 174, in create_and_run_job
    raise AirflowException(f"Databricks job failed. Job info {final_job_state}")
airflow.exceptions.AirflowException: Databricks job failed. Job info {'job_id': 1073688612166577, 'run_id': 20103644, 'creator_user_name': 'phani.kumar@astronomer.io', 'number_in_job': 20103644, 'state': {'life_cycle_state': 'TERMINATED', 'result_state': 'FAILED', 'state_message': '', 'user_cancelled_or_timedout': False}, 'task': {'spark_python_task': {'python_file': 'dbfs:/mnt/pyscripts/load_file__tmp_sgkz077l6qb3lip4hzukuutt5gvkgx7axpht6gnnqwk04lbg6t35g5j30.py'}}, 'cluster_spec': {'existing_cluster_id': '0403-094356-wab883hn'}, 'cluster_instance': {'cluster_id': '0403-094356-wab883hn', 'spark_context_id': '5214875259947915008'}, 'start_time': 1686642732666, 'setup_duration': 245000, 'execution_duration': 63000, 'cleanup_duration': 0, 'end_time': 1686643041670, 'run_name': 'Untitled', 'run_page_url': 'https://dbc-9c390870-65ef.cloud.databricks.com/?o=4256138892007661#job/1073688612166577/run/20103644', 'run_type': 'SUBMIT_RUN', 'attempt_number': 0, 'format': 'SINGLE_TASK'}

More details on: https://astronomer.slack.com/archives/C059004990C/p1686738542418129

Previously, it was an intermittent issue. But now it looks like we are seeing this error repeatedly without a single success.
The issue comes on Databricks when it is not able to configure the AWS credentials correctly for accessing S3.
Previously, we tried to create a new Databricks cluster with updating cluster version which ran fine for a while,
but looks like it is not helping us.