GoogleCloudPlatform/gcs-fuse-csi-driver

rpc error: code = Internal desc = the sidecar container terminated due to ContainerStatusUnknown

franciscocpg opened this issue · 7 comments

I'm using this driver to mount a gcs bucket to a local folder in the pod and I'm seeing the error below:

kubelet  MountVolume.SetUp failed for volume "XXX" : rpc error: code = Internal desc = the sidecar container terminated due to ContainerStatusUnknown

I'm running a GKE cluster and this issue happens only on pods running on preemptible nodes that are restarted due to node replacement.

Hi @franciscocpg , thanks for reporting this issue.

This is actually expected since workloads on preemptible nodes can be interrupted. However, we can improve the error code and the documentation.

I have a question: Is your workload affected due to this error? If so, could you share more information?

Hi @songjiaxun.

I have a question: Is your workload affected due to this error? If so, could you share more information?

Yes, the pods got stuck in ContainerStatusUnknown or ContainerCreating or Error status and the number of desired running replicas is not respected, eg:

$ kubectl get deploy my-deploy
NAME         READY   UP-TO-DATE   AVAILABLE   AGE
my-deploy    4/6     6            4           6y176d

$ kubectl get po -l app=my-deploy
NAME                          READY   STATUS                   RESTARTS   AGE
my-deploy-859b7469ff-5bt8v    0/2     Error                    1          101m
my-deploy-859b7469ff-69jdr    0/2     Error                    1          96m
my-deploy-859b7469ff-rqdbq    0/2     ContainerStatusUnknown   2          102m
my-deploy-859b7469ff-tbg4h    0/2     Error                    1          100m
my-deploy-859b7469ff-tppbc    0/2     ContainerStatusUnknown   1          95m
my-deploy-859b7469ff-zq6gm    0/2     ContainerStatusUnknown   1          95m
my-deploy-898c96857-5j69g     0/2     ContainerCreating        2          90m
my-deploy-898c96857-72g5m     2/2     Running                  0          90m
my-deploy-898c96857-bdff8     2/2     Running                  0          89m
my-deploy-898c96857-c7cvm     2/2     Running                  0          89m
my-deploy-898c96857-phggw     0/2     ContainerCreating        2          89m
my-deploy-898c96857-z6vhk     2/2     Running                  0          89m

So I need to manually delete the pods with issues to have the 6 replicas available again.

@franciscocpg thanks for the info! Let me try to repro on my side and get you updated.

Thank you @songjiaxun.

My current workaround is to install gcsfuse in the docker image that the pod is running and mount/unmount the local folder from the gcs bucket using the lifecycle hooks, eg:

(...)
spec:
  template:
    spec:
      containers:
      - lifecycle:
          postStart:
            exec:
              command:
              - /bin/sh
              - -c
              - mkdir -p /local/folder && gcsfuse --implicit-dirs -o allow_other my-bucket /local/folder
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - fusermount -u /local/folder
(...)

Hi @franciscocpg ,

We recently root-caused a similar issue: #113 (comment)

Could you confirm if you are also using 1.26 clusters, and the errors happen during node preemption?

If so, upgrading to 1.27 clusters may fix the issue.

Hi @songjiaxun.

We are using v1.25.12-gke.500 and cannot upgrade at the moment.

But I'll try again as soon as we can upgrade.

This issue should have been resolved in the current CSI driver. Closing for now. Please re-open if similar issues are seen.