airflow-helm/charts

Scheduder crashed with error - "PermissionError: [Errno 13] Permission denied " even with extraInitContainers and PVC.

Closed this issue · 1 comments

Checks

Chart Version

1.13.1

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.1", GitCommit:"8f94681cd294aa8cfd3407b8191f6c70214973a4", GitTreeState:"clean", BuildDate:"2023-01-18T15:51:24Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"28+", GitVersion:"v1.28.6-eks-508b6b3", GitCommit:"25a726351cee8ee6facce01af4214605e089d5da", GitTreeState:"clean", BuildDate:"2024-01-29T20:58:56Z", GoVersion:"go1.20.13", Compiler:"gc", Platform:"linux/amd64"}

Custom docker file

FROM apache/airflow:2.9.0

RUN pip install apache-airflow-providers-tableau snowflake-connector-python snowflake-sqlalchemy apache-airflow-providers-snowflake pendulum

Helm Version

version.BuildInfo{Version:"v3.11.1", GitCommit:"293b50c65d4d56187cd4e2f390f0ada46b4c4737", GitTreeState:"clean", GoVersion:"go1.19.5"}

Description

Hi,

I am my using custom docker image based on the official docker image, with the latest version of Airflow - 2.9.0.
I'm able to deploy Airflow using the official helm chart on AWS EKS.

But after a while, my scheduler just keeps restarting in a loop. Then I found that the issue was that scheduler-log-groom was missing permission on the ‘/opt/airflow/logs’ folder.

Then I updated my values.yaml file with extraInitContainers(spec attached below) in the scheduler.

But, After upgradging chart I still receive scheduler errors in the logs. Now I see that Livenes Probe is not able to access to the "/opt/airflow/logs/scheduler" folder.

Relevant Logs

Name:             dna-airflow-scheduler-5cc8cfd8f6-hx2bl
Namespace:        airflow
Priority:         0
Service Account:  dna-airflow-scheduler
Node:             ip-172-18-231-89.us-west-2.compute.internal/172.18.231.89
Start Time:       Fri, 12 Apr 2024 22:49:41 +0200
Labels:           component=scheduler
                  pod-template-hash=5cc8cfd8f6
                  release=dna-airflow
                  tier=airflow
Annotations:      checksum/airflow-config: 6fb676fa1295f9e8afd5408033a62ecaf465ddd5339dff805ffbcf8e653848dc
                  checksum/extra-configmaps: e862ea47e13e634cf17d476323784fa27dac20015550c230953b526182f5cac8
                  checksum/extra-secrets: e9582fdd622296c976cbc10a5ba7d6702c28a24fe80795ea5b84ba443a56c827
                  checksum/metadata-secret: b2fe937560e9635aeb01fce9100c2f836c5880f81c802565ce95fbcc8a56da4c
                  checksum/pgbouncer-config-secret: 1dae2adc757473469686d37449d076b0c82404f61413b58ae68b3c5e99527688
                  checksum/result-backend-secret: 98a68f230007cfa8f5d3792e1aff843a76b0686409e4a46ab2f092f6865a1b71
                  cluster-autoscaler.kubernetes.io/safe-to-evict: true
Status:           Running
IP:               100.65.129.143
IPs:
  IP:           100.65.129.143
Controlled By:  ReplicaSet/dna-airflow-scheduler-5cc8cfd8f6
Init Containers:
  wait-for-airflow-migrations:
    Container ID:  containerd://2bce7a37555105485909d706f8d5264156f87a598dc9e14a0272db55cb5f328c
    Image:         ravilkhalilov/airflow-demo@sha256:2af0e928daca24e5b83e1ac4e8d701cf72d2c0de5f3f1e38937826218e860235
    Image ID:      docker.io/ravilkhalilov/airflow-demo@sha256:2af0e928daca24e5b83e1ac4e8d701cf72d2c0de5f3f1e38937826218e860235
    Port:          <none>
    Host Port:     <none>
    Args:
      airflow
      db
      check-migrations
      --migration-wait-timeout=60
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 12 Apr 2024 22:49:44 +0200
      Finished:     Fri, 12 Apr 2024 22:50:17 +0200
    Ready:          True
    Restart Count:  0
    Environment:
      AIRFLOW__WEBSERVER__EXPOSE_CONFIG:    True
      AIRFLOW__CORE__FERNET_KEY:            <set to the key 'fernet-key' in secret 'dna-airflow-fernet-key'>  Optional: false
      AIRFLOW_HOME:                         /opt/airflow
      AIRFLOW__CORE__SQL_ALCHEMY_CONN:      <set to the key 'connection' in secret 'airflow-rds-db'>                              Optional: false
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN:  <set to the key 'connection' in secret 'airflow-rds-db'>                              Optional: false
      AIRFLOW_CONN_AIRFLOW_DB:              <set to the key 'connection' in secret 'airflow-rds-db'>                              Optional: false
      AIRFLOW__WEBSERVER__SECRET_KEY:       <set to the key 'webserver-secret-key' in secret 'dna-airflow-webserver-secret-key'>  Optional: false
    Mounts:
      /opt/airflow/airflow.cfg from config (ro,path="airflow.cfg")
      /opt/airflow/config/airflow_local_settings.py from config (ro,path="airflow_local_settings.py")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-t7tng (ro)
  git-sync-init:
    Container ID:   containerd://6fab71c316323da5dd38ab10af8c70ca0fd6590152f985fb3ed152987631dafd
    Image:          registry.k8s.io/git-sync/git-sync:v4.1.0
    Image ID:       registry.k8s.io/git-sync/git-sync@sha256:fd9722fd02e3a559fd6bb4427417c53892068f588fc8372aa553fbf2f05f9902
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 12 Apr 2024 22:50:18 +0200
      Finished:     Fri, 12 Apr 2024 22:50:21 +0200
    Ready:          True
    Restart Count:  0
    Environment:
      GIT_SYNC_USERNAME:           <set to the key 'GIT_SYNC_USERNAME' in secret 'git-credentials'>  Optional: false
      GITSYNC_USERNAME:            <set to the key 'GITSYNC_USERNAME' in secret 'git-credentials'>   Optional: false
      GIT_SYNC_PASSWORD:           <set to the key 'GIT_SYNC_PASSWORD' in secret 'git-credentials'>  Optional: false
      GITSYNC_PASSWORD:            <set to the key 'GITSYNC_PASSWORD' in secret 'git-credentials'>   Optional: false
      GIT_SYNC_REV:                HEAD
      GITSYNC_REF:                 main
      GIT_SYNC_BRANCH:             main
      GIT_SYNC_REPO:               https://github.com/.../airflow-bizapps-dev.git
      GITSYNC_REPO:                https://github.com/.../airflow-bizapps-dev.git
      GIT_SYNC_DEPTH:              1
      GITSYNC_DEPTH:               1
      GIT_SYNC_ROOT:               /git
      GITSYNC_ROOT:                /git
      GIT_SYNC_DEST:               repo
      GITSYNC_LINK:                repo
      GIT_SYNC_ADD_USER:           true
      GITSYNC_ADD_USER:            true
      GITSYNC_PERIOD:              5s
      GIT_SYNC_MAX_SYNC_FAILURES:  0
      GITSYNC_MAX_FAILURES:        0
      GIT_SYNC_ONE_TIME:           true
      GITSYNC_ONE_TIME:            true
    Mounts:
      /git from dags (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-t7tng (ro)
  fix-volume-logs-permissions:
    Container ID:  containerd://1cae61ad7f4eab50dd360f064b8645fd6c7fde7696e2346e2098d5d3e81c4879
    Image:         busybox
    Image ID:      docker.io/library/busybox@sha256:c3839dd800b9eb7603340509769c43e146a74c63dca3045a8e7dc8ee07e53966
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      chown -R 50000:0 /opt/airflow/logs/
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 12 Apr 2024 22:50:23 +0200
      Finished:     Fri, 12 Apr 2024 22:50:23 +0200
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /opt/airflow/logs/ from logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-t7tng (ro)
Containers:
  scheduler:
    Container ID:  containerd://43eff9161606c18596c87a4504de3e782367cf87ec63ecc0650f72db32d75032
    Image:         ravilkhalilov/airflow-demo@sha256:2af0e928daca24e5b83e1ac4e8d701cf72d2c0de5f3f1e38937826218e860235
    Image ID:      docker.io/ravilkhalilov/airflow-demo@sha256:2af0e928daca24e5b83e1ac4e8d701cf72d2c0de5f3f1e38937826218e860235
    Port:          <none>
    Host Port:     <none>
    Args:
      bash
      -c
      exec airflow scheduler
    State:          Running
      Started:      Fri, 12 Apr 2024 23:32:46 +0200
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 12 Apr 2024 23:23:46 +0200
      Finished:     Fri, 12 Apr 2024 23:32:45 +0200
    Ready:          True
    Restart Count:  6
    Liveness:       exec [sh -c CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check --job-type SchedulerJob --local
] delay=10s timeout=20s period=60s #success=1 #failure=5
    Startup:  exec [sh -c CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check --job-type SchedulerJob --local
] delay=0s timeout=20s period=10s #success=1 #failure=6
    Environment:
      AIRFLOW__WEBSERVER__EXPOSE_CONFIG:    True
      AIRFLOW__CORE__FERNET_KEY:            <set to the key 'fernet-key' in secret 'dna-airflow-fernet-key'>  Optional: false
      AIRFLOW_HOME:                         /opt/airflow
      AIRFLOW__CORE__SQL_ALCHEMY_CONN:      <set to the key 'connection' in secret 'airflow-rds-db'>                              Optional: false
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN:  <set to the key 'connection' in secret 'airflow-rds-db'>                              Optional: false
      AIRFLOW_CONN_AIRFLOW_DB:              <set to the key 'connection' in secret 'airflow-rds-db'>                              Optional: false
      AIRFLOW__WEBSERVER__SECRET_KEY:       <set to the key 'webserver-secret-key' in secret 'dna-airflow-webserver-secret-key'>  Optional: false
    Mounts:
      /opt/airflow/airflow.cfg from config (ro,path="airflow.cfg")
      /opt/airflow/config/airflow_local_settings.py from config (ro,path="airflow_local_settings.py")
      /opt/airflow/dags from dags (ro)
      /opt/airflow/logs from logs (rw)
      /opt/airflow/pod_templates/pod_template_file.yaml from config (ro,path="pod_template_file.yaml")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-t7tng (ro)
  git-sync:
    Container ID:   containerd://efae96265bca88f581c3e92c5f138f3ae7ca4db805aa0797016ccb375ad4b90e
    Image:          registry.k8s.io/git-sync/git-sync:v4.1.0
    Image ID:       registry.k8s.io/git-sync/git-sync@sha256:fd9722fd02e3a559fd6bb4427417c53892068f588fc8372aa553fbf2f05f9902
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Fri, 12 Apr 2024 22:50:23 +0200
    Ready:          True
    Restart Count:  0
    Environment:
      GIT_SYNC_USERNAME:           <set to the key 'GIT_SYNC_USERNAME' in secret 'git-credentials'>  Optional: false
      GITSYNC_USERNAME:            <set to the key 'GITSYNC_USERNAME' in secret 'git-credentials'>   Optional: false
      GIT_SYNC_PASSWORD:           <set to the key 'GIT_SYNC_PASSWORD' in secret 'git-credentials'>  Optional: false
      GITSYNC_PASSWORD:            <set to the key 'GITSYNC_PASSWORD' in secret 'git-credentials'>   Optional: false
      GIT_SYNC_REV:                HEAD
      GITSYNC_REF:                 main
      GIT_SYNC_BRANCH:             main
      GIT_SYNC_REPO:               https://github.com/.../airflow-bizapps-dev.git
      GITSYNC_REPO:                https://github.com/.../airflow-bizapps-dev.git
      GIT_SYNC_DEPTH:              1
      GITSYNC_DEPTH:               1
      GIT_SYNC_ROOT:               /git
      GITSYNC_ROOT:                /git
      GIT_SYNC_DEST:               repo
      GITSYNC_LINK:                repo
      GIT_SYNC_ADD_USER:           true
      GITSYNC_ADD_USER:            true
      GITSYNC_PERIOD:              5s
      GIT_SYNC_MAX_SYNC_FAILURES:  0
      GITSYNC_MAX_FAILURES:        0
    Mounts:
      /git from dags (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-t7tng (ro)
  scheduler-log-groomer:
    Container ID:  containerd://dc0b21c1c37f7aed6a7ff67e812ff95a125297ddae1975f2365511cf9b0a3cbc
    Image:         ravilkhalilov/airflow-demo@sha256:2af0e928daca24e5b83e1ac4e8d701cf72d2c0de5f3f1e38937826218e860235
    Image ID:      docker.io/ravilkhalilov/airflow-demo@sha256:2af0e928daca24e5b83e1ac4e8d701cf72d2c0de5f3f1e38937826218e860235
    Port:          <none>
    Host Port:     <none>
    Args:
      bash
      /clean-logs
    State:          Running
      Started:      Fri, 12 Apr 2024 22:50:24 +0200
    Ready:          True
    Restart Count:  0
    Environment:
      AIRFLOW__LOG_RETENTION_DAYS:  15
      AIRFLOW_HOME:                 /opt/airflow
    Mounts:
      /opt/airflow/logs from logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-t7tng (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      dna-airflow-config
    Optional:  false
  dags:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  logs:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  dna-airflow-logs
    ReadOnly:   false
  kube-api-access-t7tng:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Normal   Scheduled  48m   default-scheduler  Successfully assigned airflow/dna-airflow-scheduler-5cc8cfd8f6-hx2bl to ip-172-18-231-89.us-west-2.compute.internal
  Normal   Pulling    48m   kubelet            Pulling image "ravilkhalilov/airflow-demo@sha256:2af0e928daca24e5b83e1ac4e8d701cf72d2c0de5f3f1e38937826218e860235"
  Normal   Pulled     48m   kubelet            Successfully pulled image "ravilkhalilov/airflow-demo@sha256:2af0e928daca24e5b83e1ac4e8d701cf72d2c0de5f3f1e38937826218e860235" in 1.918s (1.918s including waiting)
  Normal   Created    48m   kubelet            Created container wait-for-airflow-migrations
  Normal   Started    48m   kubelet            Started container wait-for-airflow-migrations
  Normal   Pulled     48m   kubelet            Container image "registry.k8s.io/git-sync/git-sync:v4.1.0" already present on machine
  Normal   Created    48m   kubelet            Created container git-sync-init
  Normal   Started    48m   kubelet            Started container git-sync-init
  Normal   Pulling    48m   kubelet            Pulling image "busybox"
  Normal   Pulled     48m   kubelet            Successfully pulled image "busybox" in 619ms (619ms including waiting)
  Normal   Created    48m   kubelet            Created container fix-volume-logs-permissions
  Normal   Pulled     48m   kubelet            Container image "ravilkhalilov/airflow-demo@sha256:2af0e928daca24e5b83e1ac4e8d701cf72d2c0de5f3f1e38937826218e860235" already present on machine
  Normal   Started    48m   kubelet            Started container fix-volume-logs-permissions
  Normal   Created    48m   kubelet            Created container scheduler
  Normal   Started    48m   kubelet            Started container scheduler
  Normal   Pulled     48m   kubelet            Container image "registry.k8s.io/git-sync/git-sync:v4.1.0" already present on machine
  Normal   Created    48m   kubelet            Created container git-sync
  Normal   Started    48m   kubelet            Started container git-sync
  Normal   Created    48m   kubelet            Created container scheduler-log-groomer
  Normal   Started    48m   kubelet            Started container scheduler-log-groomer
  Warning  Unhealthy  47m   kubelet            Startup probe failed: command "sh -c CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \\\nairflow jobs check --job-type SchedulerJob --local\n" timed out
  Warning  Unhealthy  47m   kubelet            Startup probe failed: /home/airflow/.local/lib/python3.12/site-packages/airflow/metrics/statsd_logger.py:184 RemovedInAirflow3Warning: The basic metric validator will be deprecated in the future in favor of pattern-matching.  You can try this now by setting config option metrics_use_pattern_match to True.
No alive jobs found.
  Normal   Pulled     47m (x2 over 48m)   kubelet  Container image "ravilkhalilov/airflow-demo@sha256:2af0e928daca24e5b83e1ac4e8d701cf72d2c0de5f3f1e38937826218e860235" already present on machine
  Warning  Unhealthy  47m                 kubelet  Startup probe failed:
  Warning  Unhealthy  47m                 kubelet  Startup probe errored: rpc error: code = NotFound desc = failed to exec in container: failed to load task: no running task found: task cb4dafd9cf37d9ad90bd10a4d36ae47b7d3c3b714efd4a1022a971fce25ca6be not found: not found
  Warning  Unhealthy  49s (x26 over 45m)  kubelet  Liveness probe failed: /home/airflow/.local/lib/python3.12/site-packages/airflow/metrics/statsd_logger.py:184 RemovedInAirflow3Warning: The basic metric validator will be deprecated in the future in favor of pattern-matching.  You can try this now by setting config option metrics_use_pattern_match to True.
Unable to load the config, contains a configuration error.
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/pathlib.py", line 1311, in mkdir
    os.mkdir(self, mode)
PermissionError: [Errno 13] Permission denied: '/opt/airflow/logs/scheduler/2024-04-12'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/logging/config.py", line 581, in configure
    handler = self.configure_handler(handlers[name])
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/logging/config.py", line 848, in configure_handler
    result = factory(**kwargs)
             ^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/utils/log/file_processor_handler.py", line 53, in __init__
    Path(self._get_log_directory()).mkdir(parents=True, exist_ok=True)
  File "/usr/local/lib/python3.12/pathlib.py", line 1320, in mkdir
    if not exist_ok or not self.is_dir():
                           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/pathlib.py", line 875, in is_dir
    return S_ISDIR(self.stat().st_mode)
                   ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/pathlib.py", line 840, in stat
    return os.stat(self, follow_symlinks=follow_symlinks)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: '/opt/airflow/logs/scheduler/2024-04-12'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/airflow/.local/bin/airflow", line 5, in <module>
    from airflow.__main__ import main
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/__init__.py", line 61, in <module>
    settings.initialize()
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/settings.py", line 531, in initialize
    LOGGING_CLASS_PATH = configure_logging()
                         ^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/logging_config.py", line 74, in configure_logging
    raise e
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/logging_config.py", line 69, in configure_logging
    dictConfig(logging_config)
  File "/usr/local/lib/python3.12/logging/config.py", line 914, in dictConfig
    dictConfigClass(config).configure()
  File "/usr/local/lib/python3.12/logging/config.py", line 588, in configure
    raise ValueError('Unable to configure handler '
ValueError: Unable to configure handler 'processor'

Custom Helm Values

################################################
### CONFIG | Sheduler
################################################
# Airflow scheduler settings
scheduler:
 extraInitContainers:
  - name: fix-volume-logs-permissions
    image: busybox
    command: [ "sh", "-c", "chown -R 50000:0 /opt/airflow/logs/" ]
    securityContext:
      runAsUser: 0
    volumeMounts:
      - mountPath: /opt/airflow/logs/
        name: logs

#################
### CONFIG | Logs
#################
logs:
  persistence:
    enabled: true
    size: 80Gi
    annotations: {}
    storageClassName: 
    existingClaim: dna-airflow-logs

#####################################
### CONFIG | Logs PVC (AWS EKS + EBS CSI)
#####################################
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    meta.helm.sh/release-name: dna-airflow
    meta.helm.sh/release-namespace: airflow
    volume.beta.kubernetes.io/storage-provisioner: ebs.csi.aws.com
    volume.kubernetes.io/selected-node: ip-172-18-231-89.us-west-2.compute.internal
    volume.kubernetes.io/storage-provisioner: ebs.csi.aws.com
  creationTimestamp: "2024-04-10T20:51:02Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app.kubernetes.io/managed-by: Helm
    chart: airflow-1.13.1
    component: logs-pvc
    heritage: Helm
    release: dna-airflow
    tier: airflow
  name: dna-airflow-logs
  namespace: airflow
  resourceVersion: "123093957"
  uid: 4d24054d-5e6a-49bf-ba8c-79f9ea9298a3
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: gp3
  volumeMode: Filesystem

The issue related to Official helm chart. Closing issue and opened in Official chart page.