airflow-helm/charts

liveness probe queries causing high database IO usage on airflow 2.5.3

bitsofdave opened this issue · 1 comments

Checks

Chart Version

8.7.1

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:38:26Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.17-eks-5e0fdde", GitCommit:"221f136737816ce068aa975f082cd0bf81c7ad5f", GitTreeState:"clean", BuildDate:"2024-01-02T20:35:57Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}

Helm Version

version.BuildInfo{Version:"v3.8.0", GitCommit:"d14138609b01886f544b2025f5000351c9eb092e", GitTreeState:"clean", GoVersion:"go1.17.6"}

Description

This change to the liveness probes in support of airflow 2.6.0 is causing high usage database on IO when using airflow 2.5.3.

From what I can tell, the part of the query that is causing poor performance is this change:

-                          .query(SchedulerJob) \
+                          .query(Job) \
+                          .filter_by(job_type=SchedulerJobRunner.job_type) 

I believe the issue is that the SchedulerJob class in 2.5.3 doesn't set a value for job_type.

The new query doesn't correctly filter by job_type (notice the filter clause is job.job_type = job.job_type):

SELECT job.id, job.dag_id, job.state, job.job_type, job.start_date, job.end_date, job.latest_heartbeat, job.executor_class, job.hostname, job.unixname 
FROM job 
WHERE job.job_type = job.job_type AND job.hostname = %(hostname_1)s ORDER BY job.latest_heartbeat DESC 
 LIMIT %(param_1)s

Old query used in helm chart v8.7.0 and earlier used to filter by job_type:

SELECT job.id, job.dag_id, job.state, job.job_type, job.start_date, job.end_date, job.latest_heartbeat, job.executor_class, job.hostname, job.unixname 
FROM job 
WHERE job.hostname = %(hostname_1)s AND job.job_type IN (__[POSTCOMPILE_job_type_1]) ORDER BY job.latest_heartbeat DESC 
 LIMIT %(param_1)s

While the new query does work, its inefficiency is can cause database IO to be throttled, resulting in liveness probe failures due to query timeouts.

Relevant Logs

Warning  Unhealthy         7m29s (x5 over 10m)    kubelet             Liveness probe failed: The SchedulerJob (id=97704647) for hostname 'airflow-scheduler-8586fc7d9f-br7vb' is not alive

Custom Helm Values

No response