GoogleCloudPlatform/gcs-fuse-csi-driver

Sidecar mounter grace period

jadenlemmon opened this issue ยท 4 comments

First let me say, I'm excited that GKE has native support for this now. I'm excited to use this driver in projects.

I have a question regarding the grace period timeout that happens before the mounter sidecar exists. It seems like it waits 30 seconds before exiting. I'm using this inside of short term workloads and this causes them all to run for an extra 30 seconds.

Looking at the code it seems like it isn't currently overridable. I see there is a gracePeriod flag that could be overridden here however the webhook dosen't allow passing in overrides.

Is there a reason for not allowing an override there? Would this be something a PR could be accepted for?

I have deployed this driver natively in GKE by using the addons approach.

Hi @jadenlemmon , thanks for the question.

This is about how the CSI driver handles the grace period. We have two scenarios

First scenario

If the Pod is a Job Pod or the RestartPolicy is Never, the workflow is like the following:

  1. Once all the workload containers besides the sidecar container in the Pod terminated, and if the Pod is a Job Pod or the RestartPolicy is Never, the CSI driver will put a "exit file" to the sidecar container emptyDir volume to ask it to terminate.
  2. The sidecar container detects the "exit file" in its emptyDir volume, it will sleep 30 seconds. This is currently hard coded and not configurable.
  3. After the sleep, the sidecar container will send a SIGTERM signal to all the gcsfuse processes to kill them immediately.
  4. Lastly, the sidecar container will terminate, marking the Pod a terminated status.

As you can see, in this scenario, it does not respect the terminationGracePeriod. This logic is actually a workaround until the Kubernetes sidecar container feature is available. The sidecar container feature will handle the sidecar container auto-termination.

Second scenario

For other workloads, the Pods are supposed to run forever. In this case, if the Pod crashes, it follows the doc Kubernetes best practices: terminating with grace. Specifically,

  1. When the Pod crashes, a SIGTERM signal is sent to the Pod.
  2. The sidecar container captures the SIGTERM signal, and sleeps for the terminationGracePeriod.
  3. After terminationGracePeriod passed, the SIGKILL signal is sent to Pod.
  4. All the containers will be forcefully killed.

Going back to your question, yes, for the first scenario, we currently do not allow users to override the 30 sec gracePeriod. One reason behind this is that we are waiting for the Kubernetes sidecar container feature to properly handle this.

But I think it's fair to allow user to override the gracePeriod even without the sidecar container feature support. Let me think about how to support this case and get you updated.

@songjiaxun That all makes sense but definitely would be nice to support the override in the meantime. Let me know if I can help in any way.

@jadenlemmon I created this commit 65d5eca to make the sidecar container respect the Pod terminationGracePeriod in the first scenario.

In your use case, you will need to specify a small terminationGracePeriodSeconds value on your Pod, e.g. 5, or even 0. Otherwise, the default value is 30 sec. I hope this improvement will be helpful.

This will be included in the next release.

In the latest release, the gracePeriod flag was removed from the sidecar container, so that the sidecar container will exit immediately after the main workload container exits. According to GCSFuse doc:

Cloud Storage by nature is strongly consistent. Cloud Storage FUSE offers close-to-open and fsync-to-open consistency. Once a file is closed, consistency is guaranteed in the following open and read immediately.

As a result, it is unnecessary for the sidecar container to wait for the terminationGracePeriod. When the workload container exits, we can assume that all the data has been flushed to bucket.