astronomer/astro-provider-databricks

Duplicate Job Creation in Databricks During Airflow DAG Runs

Closed this issue · 0 comments

Issue

Our teams at HealthPartners are encountering a recurring issue where each execution of an Airflow DAG leads to the creation of a new job, despite the job already existing within the Databricks workspace.

This issue is most likely linked to the Databricks REST API retrieving a limit of 20 jobs per request, by default. In instances where the workspace contains over 20 jobs, additional API requests are necessary utilizing the 'next_page_token' from the initial call to fetch the complete job list.

Proposed Solution

Under "_get_job_by_name" function in operators/workflow.py:

  • directly pass the job_name parameter to the jobs_api.list_jobs() method to leverage the API's built-in job name filtering capability. This approach is more efficient than fetching an exhaustive job list and subsequently filtering for the specific job.