Liveness probe fails on GCP
maciejmochol opened this issue · 2 comments
Checks
- I have checked for existing issues.
- This report is about the
User-Community Airflow Helm Chart
.
Chart Version
8.8.0
Kubernetes Version
Client Version: v1.28.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.24.15-gke.1700
WARNING: version difference between client (1.28) and server (1.24) exceeds the supported minor version skew of +/-1
Helm Version
version.BuildInfo{Version:"v3.10.1", GitCommit:"9f88ccb6aee40b9a0535fcc7efea6055e1ef72c9", GitTreeState:"clean", GoVersion:"go1.18.7"}
Description
Dear Community,
Our airflow started crashing - it seems GCP kills it due to failed liveness probe.
'POD describe' command outputs error related to the code from helm chart:
Partial output of 'kubectl pod describe airflow-worker-0':
ContainersReady True
PodScheduled True
Volumes:
logs-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-9cnpq:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: kubernetes.io/arch=amd64:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Killing 6m59s kubelet Container airflow-worker failed liveness probe, will be restarted
Warning FailedPreStopHook 6m7s kubelet Exec lifecycle hook ([timeout 540s python -Wignore -c import os
import time
import subprocess
from celery import Celery
from celery.app.control import Inspect
from typing import List
def run_command(cmd: List[str]) -> str:
process = subprocess.Popen(cmd, stdout=subprocess.PIPE)
output, error = process.communicate()
if error is not None:
raise Exception(error)
else:
return output.decode(encoding="utf-8")
broker_url = run_command(["bash", "-c", "eval $AIRFLOW__CELERY__BROKER_URL_CMD"])
local_celery_host = f"celery@{os.environ['HOSTNAME']}"
app = Celery(broker=broker_url)
# prevent the worker accepting new tasks
print(f"canceling celery consumer for {local_celery_host}...")
app.control.cancel_consumer("default", destination=[local_celery_host])
# wait until the worker finishes its current tasks
i = Inspect(app=app, destination=[local_celery_host])
active_tasks = i.active()[local_celery_host]
while len(active_tasks) > 0:
print(f"waiting [10 sec] for remaining tasks to finish: {[task.get('name') for task in active_tasks]}")
time.sleep(10)
active_tasks = i.active()[local_celery_host]
]) for Container "airflow-worker" in Pod "airflow-worker-0_accrualengine-dev(b43f8529-8578-403a-9771-56e33dbf85ee)" failed - error: command 'timeout 540s python -Wignore -c import os
import time
import subprocess
from celery import Celery
from celery.app.control import Inspect
from typing import List
def run_command(cmd: List[str]) -> str:
process = subprocess.Popen(cmd, stdout=subprocess.PIPE)
output, error = process.communicate()
if error is not None:
raise Exception(error)
else:
return output.decode(encoding="utf-8")
broker_url = run_command(["bash", "-c", "eval $AIRFLOW__CELERY__BROKER_URL_CMD"])
local_celery_host = f"celery@{os.environ['HOSTNAME']}"
app = Celery(broker=broker_url)
# prevent the worker accepting new tasks
print(f"canceling celery consumer for {local_celery_host}...")
app.control.cancel_consumer("default", destination=[local_celery_host])
# wait until the worker finishes its current tasks
i = Inspect(app=app, destination=[local_celery_host])
active_tasks = i.active()[local_celery_host]
while len(active_tasks) > 0:
print(f"waiting [10 sec] for remaining tasks to finish: {[task.get('name') for task in active_tasks]}")
time.sleep(10)
active_tasks = i.active()[local_celery_host]
' exited with 137: , message: ""
Warning Unhealthy 2m39s (x2 over 9m39s) kubelet Liveness probe failed:
Warning Unhealthy 2m39s kubelet Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "3b236681e32284282771ca3f8e45d1d577d00d06a0e84ede3ff7d489e1c98d72": OCI runtime exec failed: exec failed: unable to start container process: error executing setns process: exit status 1: unknown
Warning BackOff 2m29s (x2 over 2m37s) kubelet Back-off restarting failed container
Normal Pulled 2m17s (x4 over 75m) kubelet Container image "europe-central2-docker.pkg.dev/ae-dev-344412/aedev-repo/airflow:20230911075617" already present on machine
Normal Created 2m17s (x4 over 75m) kubelet Created container airflow-worker
Normal Started 2m15s (x4 over 75m) kubelet Started container airflow-worker
Warning Unhealthy 32s (x11 over 74m) kubelet Liveness probe failed: Traceback (most recent call last):
File "<string>", line 23, in <module>
TypeError: argument of type 'NoneType' is not iterable
Relevant Logs
No response
Custom Helm Values
No response
@maciejmochol The actual error is not related to the FailedPreStopHook
event. That is simply the pre-stop hook you enabled with ...gracefullTermination
waiting 600 seconds for jobs to finish before killing the pod (which in this case, was not enough time for whatever tasks were running to finish).
The actual liveness probe is failing with the error OCI runtime exec failed: exec failed: unable to start container process: error executing setns process: exit status 1: unknown
, which looks like something is wrong with that Kubernetes Node in your cluster, preventing the kubelet from executing commands in the container.
Questions:
- Are you setting this on all of your Nodes?
- Are you able to use
kubectl exec -it ...
to manually run commands Pods? (Especially check with Pods on the Nodes that were running these workers)
I would try:
- Restarting the Kubernetes Nodes which are showing this problem
- Checking to see if your Nodes are massively over-provisioned (in terms of RAM/CPU usage)
hello Mathew,
thanks for your prompt answer. Yes we are setting this an all nodes (we use Autopilot mode of GCP) and yes I was able to get shell from inside the worker pod(s).
I don't think our Nodes are over-provisioned as i assume in GCP Autopilot it should not be happening. But it seems you may be right in point (2) - after we disabled liveness probe the POD was getting killed by this OOMKiller beast (aka error 137). So we increased RAM for pod and workers are not crashing anymore. After some time we will enable liveness probe back to see if it all works.
But this OOM is actually worrying me. I had previously reduced AIRFLOW__CELERY__WORKER_CONCURRENCY down to 5 (from 16 which i think was the default?) and it worked fine until now where all of a sudden it seems workers started consuming more memory than usual. Is there a way to prevent Python processes from consuming too much memory?
That said i believe this is outside of a bug i raised earlier - thank you for help.