GoogleCloudPlatform/gcs-fuse-csi-driver

Long running pods time out accessing cloud storage volume OSError: [Errno 107] Transport endpoint is not connected

brokenjacobs opened this issue · 4 comments

I have a use case involving the gcs-fuse-csi-driver storing logs from airflow workers. If I have a pod that runs for 24 hours or so, over time, it starts erroring on mkdir calls to the volume:

OSError: [Errno 107] Transport endpoint is not connected: '/opt/airflow/logbucket/dag_id=xxx/run_id=scheduled__2023-07-31T00:00:00+00:00/task_id=begin_drop_day_driving_table'"

and

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/pathlib.py", line 1116, in mkdir
    os.mkdir(self, mode)
OSError: [Errno 107] Transport endpoint is not connected: '/opt/airflow/logbucket/dag_id=xxx/run_id=scheduled__2023-07-31T00:00:00+00:00/task_id=begin_drop_day_driving_table'"

Until my pod eventually fails and restarts. I tried looking for clues to this behavior but all I could see was a few people complaining about gcs-fuse having similar issues after running for a long time. Resource (cpu/ram) consumption is minimal from the pod. Any ideas what could be causing this? This is happening nightly.

I have the volume mounted using a CSI ephemeral volume if that helps.

Hi @brokenjacobs, thanks for reporting the issue. Transport endpoint is not connected means the gcsfuse binary terminated. In most of the cases we saw, it was due to OOM.

Could you share the following information?

  1. How much memory did you allocate to the gcsfuse sidecar container via Pod annotation? Or are you using the default value (256Mi)?
  2. As you are using a GCS bucket to store logs, are the logs being appended to the same file? Or are you periodically uploading different log files to the bucket?

1: Yes we are using the default memory value.
2: These logs are not appended, airflow copies the log to the bucket directory after the job completes. There are no further writes.

Typical log sizes in this cluster are around 4KiB. With up to 8 being written at once. It doesn't seem possible that this could be RAM related in the sidecar, and I'm not seeing large memory usage in the sidecar pods at all:

image

I went back through the metrics and I found one sidecar hitting 200-204MiB of ram usage but not getting killed. (metrics continue). I'll try upping the value (can't hurt) and see if it removes the issue.

Looks like this resolved my issues, no errors over the weekend. Thanks!