Astra Trident "failed to create cloned volume" - but creates it in Amazon FSx
RuairiSinkler opened this issue · 2 comments
Describe the bug
This issue is intermittent, though happens more-often-than-not.
When trying to provision an AmazonFSx Flexgroup volume as a clone from a snapshot, Trident is reporting failure (and eternally failing) despite successfully creating the volume in AmazonFSx.
See (anonymised and shortened) Trident-main logs here:
{"@timestamp":"2023-11-15T08:54:47Z","crdControllerEvent":"add","driver":"ontap-nas-flexgroup","error":"API status: failed, Reason: Insufficient privileges: user '<USER_NAME>' does not have write access to this resource, Code: 13003","level":"error","logLayer":"core","message":"Error logging EMS message.","requestID":"7ca9ab3a-545f-4909-8ff2-c1f16effe381","requestSource":"CRD","workflow":"cr=reconcile"}
...
{"@timestamp":"2023-11-15T08:55:13Z","level":"info","logLayer":"core","message":"Snapshot created.","requestID":"1e0d8a35-4d84-4c11-89c5-e7039464fa64","requestSource":"CSI","snapshotName":"snapshot-794022c3-b28b-4d06-9712-4d8cbe3dd391","volumeName":"<VOLUME_NAME>","workflow":"snapshot=create"}
...
{"@timestamp":"2023-11-15T08:56:20Z","level":"warning","logLayer":"core","message":"Job not completed after 120 seconds.","requestID":"c6f5256f-415b-445f-8c27-04b3945f87b3","requestSource":"CSI","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:56:20Z","backend":"<BACKEND_NAME>","backendUUID":"a8ee139b-d7d9-49ad-a269-49ea1f7b207d","error":"waiting for async response failed","level":"warning","logLayer":"core","message":"Failed to create cloned volume on this backend.","requestID":"c6f5256f-415b-445f-8c27-04b3945f87b3","requestSource":"CSI","sourceVolume":"pvc-22f2b128-4d86-4d2e-8607-34fe465a8348","volume":"pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:56:20Z","level":"error","logLayer":"csi_frontend","message":"GRPC error: rpc error: code = Unknown desc = failed to create cloned volume pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b on backend <BACKEND_NAME>: waiting for async response failed","requestID":"c6f5256f-415b-445f-8c27-04b3945f87b3","requestSource":"CSI"}
...
{"@timestamp":"2023-11-15T08:57:32Z","level":"warning","logLayer":"core","message":"Job not completed after 120 seconds.","requestID":"e6d3032f-ed33-4bd1-991c-975364fd6c20","requestSource":"CSI","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:57:32Z","backend":"<BACKEND_NAME>","backendUUID":"a8ee139b-d7d9-49ad-a269-49ea1f7b207d","error":"waiting for async response failed","level":"warning","logLayer":"core","message":"Failed to create cloned volume on this backend.","requestID":"e6d3032f-ed33-4bd1-991c-975364fd6c20","requestSource":"CSI","sourceVolume":"pvc-22f2b128-4d86-4d2e-8607-34fe465a8348","volume":"pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:57:32Z","level":"error","logLayer":"csi_frontend","message":"GRPC error: rpc error: code = Unknown desc = failed to create cloned volume pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b on backend <BACKEND_NAME>: waiting for async response failed","requestID":"e6d3032f-ed33-4bd1-991c-975364fd6c20","requestSource":"CSI"}
...
{"@timestamp":"2023-11-15T08:57:48Z","backend":"<BACKEND_NAME>","backendUUID":"a8ee139b-d7d9-49ad-a269-49ea1f7b207d","error":"volume <VOLUME_PREFIX>_pvc_d342bb73_0e83_4530_9f80_8bfef58fed2b already exists","level":"warning","logLayer":"core","message":"Failed to create cloned volume on this backend.","requestID":"abf74801-1725-46a2-9b11-19616b4159cb","requestSource":"CSI","sourceVolume":"pvc-22f2b128-4d86-4d2e-8607-34fe465a8348","volume":"pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:57:48Z","level":"error","logLayer":"csi_frontend","message":"GRPC error: rpc error: code = Unknown desc = failed to create cloned volume pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b on backend <BACKEND_NAME>: volume <VOLUME_PREFIX>_pvc_d342bb73_0e83_4530_9f80_8bfef58fed2b already exists","requestID":"abf74801-1725-46a2-9b11-19616b4159cb","requestSource":"CSI"}
{"@timestamp":"2023-11-15T08:58:18Z","backend":"<BACKEND_NAME>","backendUUID":"a8ee139b-d7d9-49ad-a269-49ea1f7b207d","error":"volume <VOLUME_PREFIX>_pvc_d342bb73_0e83_4530_9f80_8bfef58fed2b already exists","level":"warning","logLayer":"core","message":"Failed to create cloned volume on this backend.","requestID":"92fc6cdc-e886-4df7-8566-c467d553df75","requestSource":"CSI","sourceVolume":"pvc-22f2b128-4d86-4d2e-8607-34fe465a8348","volume":"pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:58:18Z","level":"error","logLayer":"csi_frontend","message":"GRPC error: rpc error: code = Unknown desc = failed to create cloned volume pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b on backend <BACKEND_NAME>: volume <VOLUME_PREFIX>_pvc_d342bb73_0e83_4530_9f80_8bfef58fed2b already exists","requestID":"92fc6cdc-e886-4df7-8566-c467d553df75","requestSource":"CSI"}
{"@timestamp":"2023-11-15T08:58:48Z","backend":"<BACKEND_NAME>","backendUUID":"a8ee139b-d7d9-49ad-a269-49ea1f7b207d","error":"volume <VOLUME_PREFIX>_pvc_d342bb73_0e83_4530_9f80_8bfef58fed2b already exists","level":"warning","logLayer":"core","message":"Failed to create cloned volume on this backend.","requestID":"2d709a11-c424-4112-b9c1-8022f7f22ab3","requestSource":"CSI","sourceVolume":"pvc-22f2b128-4d86-4d2e-8607-34fe465a8348","volume":"pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:58:48Z","level":"error","logLayer":"csi_frontend","message":"GRPC error: rpc error: code = Unknown desc = failed to create cloned volume pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b on backend <BACKEND_NAME>: volume <VOLUME_PREFIX>_pvc_d342bb73_0e83_4530_9f80_8bfef58fed2b already exists","requestID":"2d709a11-c424-4112-b9c1-8022f7f22ab3","requestSource":"CSI"}
What appears to be happening is that Trident is requesting the clone be created, timing out while waiting for it to be finished, and then requesting it again. The first request then succeeds, and the second fails because the "volume already exists".
This can also be seen in the AmazonFSx job history:
11/15 08:58:35 <FSX_ID>
<SVM>
Clone Create Failed 1453
Description: Create
<VOLUME_PREFIX>_pvc_d342bb73_0e83_
4530_9f80_8bfef58fed2b
Completion: Failed to create the FlexGroup clone volume
"<VOLUME_PREFIX>_pvc_d342bb73_0e83_
4530_9f80_8bfef58fed2b" in Vserver
"<SVM>". Reason: VLDB: volume
record not created, duplicate key. Cleanup
of the partially created FlexGroup clone
volumes failed. Contact technical support
for assistance.
11/15 08:57:11 <FSX_ID>
<SVM>
Clone Create Succeeded 1452
Description: Create
<VOLUME_PREFIX>_pvc_d342bb73_0e83_
4530_9f80_8bfef58fed2b
Completion: Successful
11/15 08:56:28 <FSX_ID>
<SVM>
Clone Create Running 1453
Description: Create
<VOLUME_PREFIX>_pvc_d342bb73_0e83_
4530_9f80_8bfef58fed2b
11/15 08:55:17 <FSX_ID>
<SVM>
Clone Create Running 1452
Description: Create
<VOLUME_PREFIX>_pvc_d342bb73_0e83_
4530_9f80_8bfef58fed2b
Note how Job 1452 is interleaved with the beginning of job 1453 - both requested from Trident. 1452 then succeeds, and 1453 (the one Trident is now presumably tracking) fails due to a duplicate key.
The way out of this situation appears to be manually deleting all resources on the cluster (editing out finalizers
etc.) as well as the created resources in AmazonFSx.
Environment
Provide accurate information about the environment to help us reproduce the issue.
- Trident version: 23.10.0
- Trident installation flags used: N/A
- Container runtime: containerd://1.6.19
- Kubernetes version: 1.26.10
- Kubernetes orchestrator: EKS
- Kubernetes enabled feature gates: N/A
- OS: linux
- NetApp backend types: AmazonFSx
- Other: Custom AmazonFSx user/role used with minimal required permissions granted, rather than Admin role
To Reproduce
Steps to reproduce the behaviour:
The following is completed by automatically by a custom operator:
- Create
VolumeSnapshot
with a manually imported volume PVC as the source - Wait for Trident to provision the snapshot, and mark the snapshot to be "ready"
- Create a
PVC
with the snapshot as its source to trigger a clone - Observe the
PVC
remaining pending forever, and the above logs and state of NetApp backend
Expected behaviour
A clear and concise description of what you expected to happen.
The clone should be created and bound to the PVC
as normal.
Additional context
Add any other context about the problem here.
This problem does not occur all the time, but seems to be most of the time it fails in this way. It was originally occurring on 23.01.0
originally, but is also still happening on latest 23.10.0
after upgrading.