NetApp/trident

Astra Trident "failed to create cloned volume" - but creates it in Amazon FSx

RuairiSinkler opened this issue · 2 comments

Describe the bug
This issue is intermittent, though happens more-often-than-not.

When trying to provision an AmazonFSx Flexgroup volume as a clone from a snapshot, Trident is reporting failure (and eternally failing) despite successfully creating the volume in AmazonFSx.

See (anonymised and shortened) Trident-main logs here:

{"@timestamp":"2023-11-15T08:54:47Z","crdControllerEvent":"add","driver":"ontap-nas-flexgroup","error":"API status: failed, Reason: Insufficient privileges: user '<USER_NAME>' does not have write access to this resource, Code: 13003","level":"error","logLayer":"core","message":"Error logging EMS message.","requestID":"7ca9ab3a-545f-4909-8ff2-c1f16effe381","requestSource":"CRD","workflow":"cr=reconcile"}
...
{"@timestamp":"2023-11-15T08:55:13Z","level":"info","logLayer":"core","message":"Snapshot created.","requestID":"1e0d8a35-4d84-4c11-89c5-e7039464fa64","requestSource":"CSI","snapshotName":"snapshot-794022c3-b28b-4d06-9712-4d8cbe3dd391","volumeName":"<VOLUME_NAME>","workflow":"snapshot=create"}
...
{"@timestamp":"2023-11-15T08:56:20Z","level":"warning","logLayer":"core","message":"Job not completed after 120 seconds.","requestID":"c6f5256f-415b-445f-8c27-04b3945f87b3","requestSource":"CSI","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:56:20Z","backend":"<BACKEND_NAME>","backendUUID":"a8ee139b-d7d9-49ad-a269-49ea1f7b207d","error":"waiting for async response failed","level":"warning","logLayer":"core","message":"Failed to create cloned volume on this backend.","requestID":"c6f5256f-415b-445f-8c27-04b3945f87b3","requestSource":"CSI","sourceVolume":"pvc-22f2b128-4d86-4d2e-8607-34fe465a8348","volume":"pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:56:20Z","level":"error","logLayer":"csi_frontend","message":"GRPC error: rpc error: code = Unknown desc = failed to create cloned volume pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b on backend <BACKEND_NAME>: waiting for async response failed","requestID":"c6f5256f-415b-445f-8c27-04b3945f87b3","requestSource":"CSI"}
...
{"@timestamp":"2023-11-15T08:57:32Z","level":"warning","logLayer":"core","message":"Job not completed after 120 seconds.","requestID":"e6d3032f-ed33-4bd1-991c-975364fd6c20","requestSource":"CSI","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:57:32Z","backend":"<BACKEND_NAME>","backendUUID":"a8ee139b-d7d9-49ad-a269-49ea1f7b207d","error":"waiting for async response failed","level":"warning","logLayer":"core","message":"Failed to create cloned volume on this backend.","requestID":"e6d3032f-ed33-4bd1-991c-975364fd6c20","requestSource":"CSI","sourceVolume":"pvc-22f2b128-4d86-4d2e-8607-34fe465a8348","volume":"pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:57:32Z","level":"error","logLayer":"csi_frontend","message":"GRPC error: rpc error: code = Unknown desc = failed to create cloned volume pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b on backend <BACKEND_NAME>: waiting for async response failed","requestID":"e6d3032f-ed33-4bd1-991c-975364fd6c20","requestSource":"CSI"}
...
{"@timestamp":"2023-11-15T08:57:48Z","backend":"<BACKEND_NAME>","backendUUID":"a8ee139b-d7d9-49ad-a269-49ea1f7b207d","error":"volume <VOLUME_PREFIX>_pvc_d342bb73_0e83_4530_9f80_8bfef58fed2b already exists","level":"warning","logLayer":"core","message":"Failed to create cloned volume on this backend.","requestID":"abf74801-1725-46a2-9b11-19616b4159cb","requestSource":"CSI","sourceVolume":"pvc-22f2b128-4d86-4d2e-8607-34fe465a8348","volume":"pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:57:48Z","level":"error","logLayer":"csi_frontend","message":"GRPC error: rpc error: code = Unknown desc = failed to create cloned volume pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b on backend <BACKEND_NAME>: volume <VOLUME_PREFIX>_pvc_d342bb73_0e83_4530_9f80_8bfef58fed2b already exists","requestID":"abf74801-1725-46a2-9b11-19616b4159cb","requestSource":"CSI"}
{"@timestamp":"2023-11-15T08:58:18Z","backend":"<BACKEND_NAME>","backendUUID":"a8ee139b-d7d9-49ad-a269-49ea1f7b207d","error":"volume <VOLUME_PREFIX>_pvc_d342bb73_0e83_4530_9f80_8bfef58fed2b already exists","level":"warning","logLayer":"core","message":"Failed to create cloned volume on this backend.","requestID":"92fc6cdc-e886-4df7-8566-c467d553df75","requestSource":"CSI","sourceVolume":"pvc-22f2b128-4d86-4d2e-8607-34fe465a8348","volume":"pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:58:18Z","level":"error","logLayer":"csi_frontend","message":"GRPC error: rpc error: code = Unknown desc = failed to create cloned volume pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b on backend <BACKEND_NAME>: volume <VOLUME_PREFIX>_pvc_d342bb73_0e83_4530_9f80_8bfef58fed2b already exists","requestID":"92fc6cdc-e886-4df7-8566-c467d553df75","requestSource":"CSI"}
{"@timestamp":"2023-11-15T08:58:48Z","backend":"<BACKEND_NAME>","backendUUID":"a8ee139b-d7d9-49ad-a269-49ea1f7b207d","error":"volume <VOLUME_PREFIX>_pvc_d342bb73_0e83_4530_9f80_8bfef58fed2b already exists","level":"warning","logLayer":"core","message":"Failed to create cloned volume on this backend.","requestID":"2d709a11-c424-4112-b9c1-8022f7f22ab3","requestSource":"CSI","sourceVolume":"pvc-22f2b128-4d86-4d2e-8607-34fe465a8348","volume":"pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:58:48Z","level":"error","logLayer":"csi_frontend","message":"GRPC error: rpc error: code = Unknown desc = failed to create cloned volume pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b on backend <BACKEND_NAME>: volume <VOLUME_PREFIX>_pvc_d342bb73_0e83_4530_9f80_8bfef58fed2b already exists","requestID":"2d709a11-c424-4112-b9c1-8022f7f22ab3","requestSource":"CSI"}

What appears to be happening is that Trident is requesting the clone be created, timing out while waiting for it to be finished, and then requesting it again. The first request then succeeds, and the second fails because the "volume already exists".

This can also be seen in the AmazonFSx job history:

11/15 08:58:35 <FSX_ID> 
                              <SVM> 
                                         Clone Create         Failed      1453
     Description: Create
                  <VOLUME_PREFIX>_pvc_d342bb73_0e83_
                  4530_9f80_8bfef58fed2b
      Completion: Failed to create the FlexGroup clone volume                  
                  "<VOLUME_PREFIX>_pvc_d342bb73_0e83_                  
                  4530_9f80_8bfef58fed2b" in Vserver                           
                  "<SVM>". Reason: VLDB: volume                   
                  record not created, duplicate key. Cleanup                   
                  of the partially created FlexGroup clone                     
                  volumes failed. Contact technical support                    
                  for assistance.                                              
11/15 08:57:11 <FSX_ID>                                       
                              <SVM>                               
                                         Clone Create         Succeeded   1452 
     Description: Create                                                       
                  <VOLUME_PREFIX>_pvc_d342bb73_0e83_                   
                  4530_9f80_8bfef58fed2b                                       
      Completion: Successful                                                   
11/15 08:56:28 <FSX_ID>                                       
                              <SVM>                               
                                         Clone Create         Running     1453 
     Description: Create                                                       
                  <VOLUME_PREFIX>_pvc_d342bb73_0e83_                   
                  4530_9f80_8bfef58fed2b                                       
11/15 08:55:17 <FSX_ID>                                       
                              <SVM>                               
                                         Clone Create         Running     1452 
     Description: Create                                                       
                  <VOLUME_PREFIX>_pvc_d342bb73_0e83_                   
                  4530_9f80_8bfef58fed2b  

Note how Job 1452 is interleaved with the beginning of job 1453 - both requested from Trident. 1452 then succeeds, and 1453 (the one Trident is now presumably tracking) fails due to a duplicate key.

The way out of this situation appears to be manually deleting all resources on the cluster (editing out finalizers etc.) as well as the created resources in AmazonFSx.

Environment
Provide accurate information about the environment to help us reproduce the issue.

  • Trident version: 23.10.0
  • Trident installation flags used: N/A
  • Container runtime: containerd://1.6.19
  • Kubernetes version: 1.26.10
  • Kubernetes orchestrator: EKS
  • Kubernetes enabled feature gates: N/A
  • OS: linux
  • NetApp backend types: AmazonFSx
  • Other: Custom AmazonFSx user/role used with minimal required permissions granted, rather than Admin role

To Reproduce
Steps to reproduce the behaviour:

The following is completed by automatically by a custom operator:

  • Create VolumeSnapshot with a manually imported volume PVC as the source
  • Wait for Trident to provision the snapshot, and mark the snapshot to be "ready"
  • Create a PVC with the snapshot as its source to trigger a clone
  • Observe the PVC remaining pending forever, and the above logs and state of NetApp backend

Expected behaviour
A clear and concise description of what you expected to happen.

The clone should be created and bound to the PVC as normal.

Additional context
Add any other context about the problem here.

This problem does not occur all the time, but seems to be most of the time it fails in this way. It was originally occurring on 23.01.0 originally, but is also still happening on latest 23.10.0 after upgrading.