liveness probe queries causing high database IO usage on airflow 2.5.3
bitsofdave opened this issue · 1 comments
bitsofdave commented
Checks
- I have checked for existing issues.
- This report is about the
User-Community Airflow Helm Chart
.
Chart Version
8.7.1
Kubernetes Version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:38:26Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.17-eks-5e0fdde", GitCommit:"221f136737816ce068aa975f082cd0bf81c7ad5f", GitTreeState:"clean", BuildDate:"2024-01-02T20:35:57Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}
Helm Version
version.BuildInfo{Version:"v3.8.0", GitCommit:"d14138609b01886f544b2025f5000351c9eb092e", GitTreeState:"clean", GoVersion:"go1.17.6"}
Description
This change to the liveness probes in support of airflow 2.6.0 is causing high usage database on IO when using airflow 2.5.3.
From what I can tell, the part of the query that is causing poor performance is this change:
- .query(SchedulerJob) \
+ .query(Job) \
+ .filter_by(job_type=SchedulerJobRunner.job_type)
I believe the issue is that the SchedulerJob class in 2.5.3 doesn't set a value for job_type
.
The new query doesn't correctly filter by job_type
(notice the filter clause is job.job_type = job.job_type
):
SELECT job.id, job.dag_id, job.state, job.job_type, job.start_date, job.end_date, job.latest_heartbeat, job.executor_class, job.hostname, job.unixname
FROM job
WHERE job.job_type = job.job_type AND job.hostname = %(hostname_1)s ORDER BY job.latest_heartbeat DESC
LIMIT %(param_1)s
Old query used in helm chart v8.7.0 and earlier used to filter by job_type
:
SELECT job.id, job.dag_id, job.state, job.job_type, job.start_date, job.end_date, job.latest_heartbeat, job.executor_class, job.hostname, job.unixname
FROM job
WHERE job.hostname = %(hostname_1)s AND job.job_type IN (__[POSTCOMPILE_job_type_1]) ORDER BY job.latest_heartbeat DESC
LIMIT %(param_1)s
While the new query does work, its inefficiency is can cause database IO to be throttled, resulting in liveness probe failures due to query timeouts.
Relevant Logs
Warning Unhealthy 7m29s (x5 over 10m) kubelet Liveness probe failed: The SchedulerJob (id=97704647) for hostname 'airflow-scheduler-8586fc7d9f-br7vb' is not alive
Custom Helm Values
No response