airflow-helm/charts

Liveness probe fails on GCP

maciejmochol opened this issue · 2 comments

Checks

Chart Version

8.8.0

Kubernetes Version

Client Version: v1.28.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.24.15-gke.1700
WARNING: version difference between client (1.28) and server (1.24) exceeds the supported minor version skew of +/-1

Helm Version

version.BuildInfo{Version:"v3.10.1", GitCommit:"9f88ccb6aee40b9a0535fcc7efea6055e1ef72c9", GitTreeState:"clean", GoVersion:"go1.18.7"}

Description

Dear Community,

Our airflow started crashing - it seems GCP kills it due to failed liveness probe.
'POD describe' command outputs error related to the code from helm chart:

print(f"canceling celery consumer for {local_celery_host}...")

Partial output of 'kubectl pod describe airflow-worker-0':

ContainersReady   True 
  PodScheduled      True 
Volumes:
  logs-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-9cnpq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 kubernetes.io/arch=amd64:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age    From     Message
  ----     ------             ----   ----     -------
  Normal   Killing            6m59s  kubelet  Container airflow-worker failed liveness probe, will be restarted
  Warning  FailedPreStopHook  6m7s   kubelet  Exec lifecycle hook ([timeout 540s python -Wignore -c import os
import time
import subprocess
from celery import Celery
from celery.app.control import Inspect
from typing import List

def run_command(cmd: List[str]) -> str:
    process = subprocess.Popen(cmd, stdout=subprocess.PIPE)
    output, error = process.communicate()
    if error is not None:
        raise Exception(error)
    else:
        return output.decode(encoding="utf-8")

broker_url = run_command(["bash", "-c", "eval $AIRFLOW__CELERY__BROKER_URL_CMD"])
local_celery_host = f"celery@{os.environ['HOSTNAME']}"
app = Celery(broker=broker_url)

# prevent the worker accepting new tasks
print(f"canceling celery consumer for {local_celery_host}...")
app.control.cancel_consumer("default", destination=[local_celery_host])

# wait until the worker finishes its current tasks
i = Inspect(app=app, destination=[local_celery_host])
active_tasks = i.active()[local_celery_host]
while len(active_tasks) > 0:
    print(f"waiting [10 sec] for remaining tasks to finish: {[task.get('name') for task in active_tasks]}")
    time.sleep(10)
    active_tasks = i.active()[local_celery_host]
]) for Container "airflow-worker" in Pod "airflow-worker-0_accrualengine-dev(b43f8529-8578-403a-9771-56e33dbf85ee)" failed - error: command 'timeout 540s python -Wignore -c import os
import time
import subprocess
from celery import Celery
from celery.app.control import Inspect
from typing import List

def run_command(cmd: List[str]) -> str:
    process = subprocess.Popen(cmd, stdout=subprocess.PIPE)
    output, error = process.communicate()
    if error is not None:
        raise Exception(error)
    else:
        return output.decode(encoding="utf-8")

broker_url = run_command(["bash", "-c", "eval $AIRFLOW__CELERY__BROKER_URL_CMD"])
local_celery_host = f"celery@{os.environ['HOSTNAME']}"
app = Celery(broker=broker_url)

# prevent the worker accepting new tasks
print(f"canceling celery consumer for {local_celery_host}...")
app.control.cancel_consumer("default", destination=[local_celery_host])

# wait until the worker finishes its current tasks
i = Inspect(app=app, destination=[local_celery_host])
active_tasks = i.active()[local_celery_host]
while len(active_tasks) > 0:
    print(f"waiting [10 sec] for remaining tasks to finish: {[task.get('name') for task in active_tasks]}")
    time.sleep(10)
    active_tasks = i.active()[local_celery_host]
' exited with 137: , message: ""
  Warning  Unhealthy  2m39s (x2 over 9m39s)  kubelet  Liveness probe failed:
  Warning  Unhealthy  2m39s                  kubelet  Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "3b236681e32284282771ca3f8e45d1d577d00d06a0e84ede3ff7d489e1c98d72": OCI runtime exec failed: exec failed: unable to start container process: error executing setns process: exit status 1: unknown
  Warning  BackOff    2m29s (x2 over 2m37s)  kubelet  Back-off restarting failed container
  Normal   Pulled     2m17s (x4 over 75m)    kubelet  Container image "europe-central2-docker.pkg.dev/ae-dev-344412/aedev-repo/airflow:20230911075617" already present on machine
  Normal   Created    2m17s (x4 over 75m)    kubelet  Created container airflow-worker
  Normal   Started    2m15s (x4 over 75m)    kubelet  Started container airflow-worker
  Warning  Unhealthy  32s (x11 over 74m)     kubelet  Liveness probe failed: Traceback (most recent call last):
  File "<string>", line 23, in <module>
TypeError: argument of type 'NoneType' is not iterable

Relevant Logs

No response

Custom Helm Values

No response

@maciejmochol The actual error is not related to the FailedPreStopHook event. That is simply the pre-stop hook you enabled with ...gracefullTermination waiting 600 seconds for jobs to finish before killing the pod (which in this case, was not enough time for whatever tasks were running to finish).

The actual liveness probe is failing with the error OCI runtime exec failed: exec failed: unable to start container process: error executing setns process: exit status 1: unknown, which looks like something is wrong with that Kubernetes Node in your cluster, preventing the kubelet from executing commands in the container.

Questions:

  • Are you setting this on all of your Nodes?
  • Are you able to use kubectl exec -it ... to manually run commands Pods? (Especially check with Pods on the Nodes that were running these workers)

I would try:

  1. Restarting the Kubernetes Nodes which are showing this problem
  2. Checking to see if your Nodes are massively over-provisioned (in terms of RAM/CPU usage)

hello Mathew,

thanks for your prompt answer. Yes we are setting this an all nodes (we use Autopilot mode of GCP) and yes I was able to get shell from inside the worker pod(s).

I don't think our Nodes are over-provisioned as i assume in GCP Autopilot it should not be happening. But it seems you may be right in point (2) - after we disabled liveness probe the POD was getting killed by this OOMKiller beast (aka error 137). So we increased RAM for pod and workers are not crashing anymore. After some time we will enable liveness probe back to see if it all works.

But this OOM is actually worrying me. I had previously reduced AIRFLOW__CELERY__WORKER_CONCURRENCY down to 5 (from 16 which i think was the default?) and it worked fine until now where all of a sudden it seems workers started consuming more memory than usual. Is there a way to prevent Python processes from consuming too much memory?

That said i believe this is outside of a bug i raised earlier - thank you for help.