kubernetes-sigs/aws-fsx-csi-driver

FSX CSI driver stuck in error: code = Aborted desc = An operation with the given volume="***" is already in progress

AndriiChalenko-ait opened this issue · 1 comments

/kind bug

What happened?
I have created PVC and after one mount to a pod, it can't mount the same PVC to another pod. The second pod was stuck in a pending state with the error:
rpc error: code = Aborted desc = An operation with the given volume="fs-***" is already in progress

All two of these pods executed on the same Node. We have seen that bug several times before.

What you expected to happen?
PVC must be attached to another pod and processing continues running.

How to reproduce it (as minimally and precisely as possible)?

  • Create PVC
  • Mount PVC to pod
  • Remove pod
  • Start a new pod on the same Node with the same PVC

Anything else we need to know?:
We analyzed logs on Node where we see this issue and found an event about unmounting this PVC from the system:
Dec 27 10:38:07 ip-10-191-12-229 kubelet: E1227 10:38:07.478116 9959 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/fsx.csi.aws.com^fs-***podName:aa385be7-45a0-46c2-9372-1c7beecea903 nodeName:}" failed. No retries permitted until 2023-12-27 10:38:07.978081369 +0000 UTC m=+5328.214809925 (durationBeforeRetry 500ms). Error: UnmountVolume.TearDown failed for volume "pvc-18"(UniqueName: "kubernetes.io/csi/fsx.csi.aws.com^fs-089ee63f44cbcd997") pod "aa385be7-45a0-46c2-9372-1c7beecea903" (UID: "aa385be7-45a0-46c2-9372-1c7beecea903") : kubernetes.io/csi: Unmounter.TearDownAt failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
And the next event with already in progress message:
Dec 27 10:38:07 ip-10-191-12-229 kubelet: E1227 10:38:07.512125 9959 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/fsx.csi.aws.com^fs-*** podName: nodeName:}" fail ed. No retries permitted until 2023-12-27 10:38:08.01210824 +0000 UTC m=+5328.248836618 (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume "pvc-5f4e4821-a888-4e62-8228-4f12ac700de9" (UniqueN ame: "kubernetes.io/csi/fsx.csi.aws.com^fs-***") pod "mergebam-403-20231227-1037-subjectdd1dnd11d12d13d140-qtd5q" (UID: "e6a8f6fc-8f03-41c9-b3aa-f5a1d6e61f63") : rpc error: code = Aborted desc = An operation with the given volume="fs-***" is already in progress
A process on that node regarding umount PVC:
image

A process with aws-fsx-csi-dri:
image

Environment

  • Kubernetes version (use kubectl version): v1.28.4-eks-8cb36c9
  • Driver version: v1.0.0

Hi @AndriiChalenko-ait this is related to a known bug in v1.0.0 that was corrected in v1.1.0: #360. Upgrading the CSI Driver image should resolve this issue, please feel free to reopen if it doesn't.