Random "Unable to attach or mount volumes"
kanor1306 opened this issue · 31 comments
/kind bug
What happened?
Pods stuck in ContainerCreating
and nodes rendered useless somewhat randomly.
We have been seeing this issue happening occasionally for a couple months (possibly more, since we started to use Lustre, but this only seems to appear when there is a lot going on). It is pretty disrupting in production systems.
The timeline of the issue is:
- New pod called
podX
gets created, mounts multiple Lustre volumes, both static and dynamically created. podX
is scheduled to the a node that has been working finepodX
gets stuck inContainerCreating
status. The events in the pod are these (lustre volume ID and IPs redacted:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 42m (x94 over 19h) kubelet, ip-x-x-x-x.eu-west-1.compute.internal Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[vol-23714 vol-23715 workflow-token-csshf vol-23713]: timed out waiting for the condition
Warning FailedMount 17m (x105 over 19h) kubelet, ip-x-x-x-x.eu-west-1.compute.internal Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[vol-23715 workflow-token-csshf vol-23713 vol-23714]: timed out waiting for the condition
Warning FailedMount 13m (x71 over 19h) kubelet, ip-x-x-x-x.eu-west-1.compute.internal MountVolume.SetUp failed for volume "pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e" : rpc error: code = Internal desc = Could not mount "<lustre-fs-id>.fsx.eu-west-1.amazonaws.com@tcp:/rtv3bbmv" at "/var/lib/kubelet/pods/2d0d102e-e368-4b6a-aa6a-1c49a167cd97/volumes/kubernetes.io~csi/pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e/mount": mount failed: exit status 17
Mounting command: mount
Mounting arguments: -t lustre -o flock <lustre-fs-id>.fsx.eu-west-1.amazonaws.com@tcp:/rtv3bbmv /var/lib/kubelet/pods/2d0d102e-e368-4b6a-aa6a-1c49a167cd97/volumes/kubernetes.io~csi/pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e/mount
Output: mount.lustre: mount <lustre-fs-id>.fsx.eu-west-1.amazonaws.com@tcp:/rtv3bbmv at /var/lib/kubelet/pods/2d0d102e-e368-4b6a-aa6a-1c49a167cd97/volumes/kubernetes.io~csi/pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e/mount failed: File exists
Warning FailedMount 8m1s (x85 over 19h) kubelet, ip-x-x-x-x.eu-west-1.compute.internal Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[workflow-token-csshf vol-23713 vol-23714 vol-23715]: timed out waiting for the condition
Warning FailedMount 77s (x233 over 19h) kubelet, ip-x-x-x-x.eu-west-1.compute.internal Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[vol-23713 vol-23714 vol-23715 workflow-token-csshf]: timed out waiting for the condition
Things to note:
- Once this happens in a node, multiple pods are affected, usually with similar creation times.
- The same issue happens in other nodes, with different Lustre volumes involved (at different times)
- The same Lustre volume is being mounted perfectly fine in other nodes.
- Not all lustre volumes in the affected node have the issue, some are mounted properly
- The node does not recover from this situation, the affected Lustre volumes cannot be mounted anymore in any pod in the node.
What you expected to happen?
I expect that at some point the pods stuck in ContainerCreating
are able to mount the volume, or else a way to identify that a node is not healthy "lustre-wise"
How to reproduce it (as minimally and precisely as possible)?
Don't have a reproduction recipe, it just happens randomly. There is #213 which has similar error message, but I am not sure is the same, as in my case does not seem reproducible.
Anything else we need to know?:
For context, the manifests of the StorageClass
and the PVC
, although as I said it happens in both static and dynamically provisioned Lustre volumes. And attached the logs of the driver, and kubelet, at the time of the issue: fsx-driver-kubelet.csv
StorageClass:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fsx-scratch-sc
mountOptions:
- flock
parameters:
deploymentType: SCRATCH_2
securityGroupIds: sg-0b7e22c1b83b54b18
subnetId: subnet-0b2314c14de6a3ea8
provisioner: fsx.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: Immediate
PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: fsx.csi.aws.com
labels:
app-name: executor
name: executor-483c6798-pvc
namespace: workflows
resourceVersion: "444522977"
uid: abf3d9ca-ab58-4d82-a2a9-d79768cd038e
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 120000Gi
storageClassName: fsx-scratch-sc
volumeMode: Filesystem
volumeName: pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e
status:
accessModes:
- ReadWriteOnce
capacity:
storage: 120000Gi
phase: Bound
Environment
- Kubernetes version (use
kubectl version
):
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-19T08:38:20Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.11-eks-f17b81", GitCommit:"f17b810c9e5a82200d28b6210b458497ddfcf31b", GitTreeState:"clean", BuildDate:"2021-10-15T21:46:21Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
- Driver version:
v0.8.1
installed via Helm Chartv1.4.1
- Sidecars of the driver:
sidecars:
livenessProbe:
image:
tag: v2.3.0-eks-1-20-11
nodeDriverRegistrar:
image:
tag: v2.2.0-eks-1-20-11
provisioner:
image:
tag: v2.2.2-eks-1-20-11
resizer:
image:
tag: v1.2.0-eks-1-20-11
#246 seems to address this, at least partially. Will update to 0.8.2 and see if this problem stops happening.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle-stale
We haven't seen it happening again, so I will close this, and re-open if it reoccurs
Well, this was very unlucky, Closed the issue 3 days ago, and here we are again, it just have happened again. Reopening, will add more details in another message.
This time, for 4h a node started having problems mounting one particular Lustre volume that was being used without issue in other nodes. The situation is the same as described in the original issue, pods stuck on Init state due to problems mounting a volume. This time the error in the events of the affected pods was:
Mounting command: mount
Mounting arguments: -t lustre -o flock fs-04d34cd6cccadfcfc.fsx.eu-west-1.amazonaws.com@tcp:/7levdbmv /var/lib/kubelet/pods/012b41da-50ab-49f1-9a04-43709198d27b/volumes/kubernetes.io~csi/pv-workflow-service-mnt-cache/mount
Output: mount.lustre: mount fs-04d34cd6cccadfcfc.fsx.eu-west-1.amazonaws.com@tcp:/7levdbmv at /var/lib/kubelet/pods/012b41da-50ab-49f1-9a04-43709198d27b/volumes/kubernetes.io~csi/pv-workflow-service-mnt-cache/mount failed: File exists
Essentially same as before.
Could someone with expertise in this driver comment whether this problem mounting is something that should be brought up to the Kubernetes main repository instead of this driver? It is not clear to me whether this is a general problem of Kubernetes or specific to the FSx driver.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
/remove-lifecycle rotten
@jacobwolfaws: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
/reopen
/remove-lifecycle rotten
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
/remove-lifecycle rotten
@olemarkus: Reopened this issue.
In response to this:
/reopen
/remove-lifecycle rotten
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Hi @kanor1306, Are you still having this issue?
Hi @jacobwolfaws, I still get from time to time (around 20 times in the last week in a cluster with a lot of scratch volumes rotation) the error message that I mentioned in #244 (comment).
That said, as I have implemented workarounds with retries and cordoning nodes affected by this issue, I can more or less prevent the problems that this was causing. But next week I will dive on this again and tell you for sure whether I still get pods stuck initializing due to this issue
@kanor1306 sounds good, let me know!
@jacobwolfaws I haven't had the time to look into this seriously, and a first search in the logs does not show up any hits to the problem of not being able to mount, so you can close it if you want. I can reopen if I ever see it happening again.
@kanor1306 Sounds good, it's possible one of our recent images corrected your issue. Closing
/close
@jacobwolfaws: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
@jacobwolfaws: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Seems to be related: #180
I think this issue merits further exploring
That said, as I have implemented workarounds with retries and cordoning nodes affected by this issue, I can more or less prevent the problems that this was causing.
This gives users a short-term potential mitigation, but I'd prefer to figure out the cause and deliver a fix
I would love to hear more about the mitigations you put into place @kanor1306 , if you are up for sharing! It seems that we are hitting this or something very similar as well.
@colearendt it is a pretty primitive solution that may not be suited for every case. My use case is a lot of jobs, fault tolerant and that can be retried in a cluster with autoscaling enabled and a high rate of node rotation.
For our case what seems to reduce the impact of this issue is having a CronJob that runs every few minutes and checks for pods that have been in PodInitializing
or ContainerCreating
for more than x minutes (of course x will depend on your workloads and how long it is acceptable for them to take until they are recognized as problematic). The job then compiles a list of nodes where these problematic pods are running. The idea is that those nodes are not really healthy, as they are having issues mounting Lustre volumes, so the CronJob cordons them and then delete the pods that are stuck in those mentioned two states. The nodes will not run new workloads due to the cordoning and eventually they will go away once the current, non affected, workloads finish.
There is a second case, where pods get stuck in Terminating
, which can have unexpected side effects, and that I have the theory that is also caused by the same problem (but less impact, as they workloads have already finished). In this case the CronJob compares the deletion_timestamp with the current time, and if this time has passed and the pod is still Terminating
, it gets force deleted. But this is only a acceptable idea if your nodes are not very long lived, else you risk leaving things running unknowingly.
So, as you can see, this is no magic bullet, just a sad workaround to handle this weird occurrences. It has worked for us, but just because our workloads accept this kind of solution (although can add some delays due to the waiting until consider a pod/node faulty).
Hope that it helps, or at least gives some ideas.
Thanks @kanor1306 ! That is super helpful! Unfortunately, it is about as painful of a workaround as I expected haha. Our workloads are stateful and interactive (i.e. the user is often coupled to a pod via websockets) with not-super-great retry behavior, which makes this a pretty pernicious issue for us.
I'm glad to hear we are not the only ones with funky terminating behavior too. I bailed on including that in my last message because I thought it might be something particular about our EKS cluster, but we never experienced this sort of thing until we added the FSX CSI. Our messages around cleanup / termination are usually something along the lines of:
May 11 02:29:51 ip-172-19-104-36 kubelet: time="2023-05-11T02:29:51Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/systemd/kubepods/besteffort/podccf3e475-3d22-47c4-88be-3af8a8436257/25ca0331e442fb40f81b0fd64e8e22e4082c1a909f7e8a461c0f919831cca8da: device or resource busy"
Moreover, any attempts I have made at the host level to remedy the situation has failed as well.
It looks like we are going to need to mature our operations a bit to work around this issue. @jacobwolfaws A fix here would be super helpful for us! The available workarounds seem like a bit of a stretch IMO.
@kanor1306 are you still seeing this error message:
Output: mount.lustre: mount fs-04d34cd6cccadfcfc.fsx.eu-west-1.amazonaws.com@tcp:/7levdbmv at /var/lib/kubelet/pods/012b41da-50ab-49f1-9a04-43709198d27b/volumes/kubernetes.io~csi/pv-workflow-service-mnt-cache/mount failed: File exists
This suggests an idempotency issue when mounting. We should have a check for that here: https://github.com/kubernetes-sigs/aws-fsx-csi-driver/blame/master/pkg/driver/node.go#L104 but if that's not working for you on the latest csi driver we'll need to revisit that.
The terminating issue could be separate or related, if you have logs from that event you're comfortable sharing that might be helpful.
@colearendt Please cut a support case to the AWS team and share the case ID with me over email at jacwolf@amazon.com . I want to collect some additional logs to root cause the issues you're seeing.
Hi @jacobwolfaws I don't have access anymore to the system that was having these issues. But I don't recall seeing them anymore while I had access. In my last comment about the issues and the workaround was based in the issues I had in the original post, but the workaround were also hiding the errors, so is difficult to say.
@kanor1306 Ah I see. I'm going to close this issue since this is no longer an issue for you. @colearendt please feel free to open a new issue with logs/information regarding the specific issue you're seeing
/close
@jacobwolfaws: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.