rpc error: code = Internal desc = the sidecar container terminated due to ContainerStatusUnknown
franciscocpg opened this issue · 7 comments
I'm using this driver to mount a gcs bucket to a local folder in the pod and I'm seeing the error below:
kubelet MountVolume.SetUp failed for volume "XXX" : rpc error: code = Internal desc = the sidecar container terminated due to ContainerStatusUnknown
I'm running a GKE cluster and this issue happens only on pods running on preemptible nodes that are restarted due to node replacement.
Hi @franciscocpg , thanks for reporting this issue.
This is actually expected since workloads on preemptible nodes can be interrupted. However, we can improve the error code and the documentation.
I have a question: Is your workload affected due to this error? If so, could you share more information?
Hi @songjiaxun.
I have a question: Is your workload affected due to this error? If so, could you share more information?
Yes, the pods got stuck in ContainerStatusUnknown
or ContainerCreating
or Error
status and the number of desired running replicas is not respected, eg:
$ kubectl get deploy my-deploy
NAME READY UP-TO-DATE AVAILABLE AGE
my-deploy 4/6 6 4 6y176d
$ kubectl get po -l app=my-deploy
NAME READY STATUS RESTARTS AGE
my-deploy-859b7469ff-5bt8v 0/2 Error 1 101m
my-deploy-859b7469ff-69jdr 0/2 Error 1 96m
my-deploy-859b7469ff-rqdbq 0/2 ContainerStatusUnknown 2 102m
my-deploy-859b7469ff-tbg4h 0/2 Error 1 100m
my-deploy-859b7469ff-tppbc 0/2 ContainerStatusUnknown 1 95m
my-deploy-859b7469ff-zq6gm 0/2 ContainerStatusUnknown 1 95m
my-deploy-898c96857-5j69g 0/2 ContainerCreating 2 90m
my-deploy-898c96857-72g5m 2/2 Running 0 90m
my-deploy-898c96857-bdff8 2/2 Running 0 89m
my-deploy-898c96857-c7cvm 2/2 Running 0 89m
my-deploy-898c96857-phggw 0/2 ContainerCreating 2 89m
my-deploy-898c96857-z6vhk 2/2 Running 0 89m
So I need to manually delete the pods with issues to have the 6 replicas available again.
@franciscocpg thanks for the info! Let me try to repro on my side and get you updated.
Thank you @songjiaxun.
My current workaround is to install gcsfuse
in the docker image that the pod is running and mount/unmount the local folder from the gcs bucket using the lifecycle
hooks, eg:
(...)
spec:
template:
spec:
containers:
- lifecycle:
postStart:
exec:
command:
- /bin/sh
- -c
- mkdir -p /local/folder && gcsfuse --implicit-dirs -o allow_other my-bucket /local/folder
preStop:
exec:
command:
- /bin/sh
- -c
- fusermount -u /local/folder
(...)
Hi @franciscocpg ,
We recently root-caused a similar issue: #113 (comment)
Could you confirm if you are also using 1.26 clusters, and the errors happen during node preemption?
If so, upgrading to 1.27 clusters may fix the issue.
Hi @songjiaxun.
We are using v1.25.12-gke.500
and cannot upgrade at the moment.
But I'll try again as soon as we can upgrade.
This issue should have been resolved in the current CSI driver. Closing for now. Please re-open if similar issues are seen.