kubernetes-sigs/aws-fsx-csi-driver

Allow configurable timeout for WaitForFileSystemAvailable

jon-rei opened this issue ยท 21 comments

Is your feature request related to a problem? Please describe.

We are starting to use Scratch 2 FSx FileSystems and are doing so successfully for small ones. But when we try to create FSx volumes >50TB we always run into timeouts:

E0701 17:57:05.334673       1 driver.go:104] "GRPC error" err="rpc error: code = Aborted desc = Create volume request for pvc-1234 is already in progress"
E0701 17:57:06.262188       1 driver.go:104] "GRPC error" err=<
	rpc error: code = Internal desc = Filesystem is not ready: RequestCanceled: request context canceled
	caused by: context canceled
 >

It seems that the timeout is always 5 minutes, but creating a 72TB volume for example takes ~12 minutes.

Describe the solution you'd like in detail

It would be great if this timeout could be increased or made configurable since it's hardcoded here.

Describe alternatives you've considered

Right now we are trying to create the filesystems manually and then create a PV + PVC. But we could really create the volumes dynamically for the pipelines we want to run.

I wrote a small script to check the creation time of the filesystems and found that as they get bigger they take longer to create. I did my tests in eu-west-1 and I'm not sure if this is something specific to that region.

FSx for Lustre file system with 4800 GiB was available after: 6 minutes 44 seconds
FSx for Lustre file system with 9600 GiB was available after 9 minutes 6 seconds
FSx for Lustre file system with 24000 GiB was available after 8 minutes 32 seconds
FSx for Lustre file system with 50400 GiB was available after 9 minutes 5 seconds
FSx for Lustre file system with 100800 GiB was available after 11 minutes 19 seconds

So I think it would be great if the timeout could either grow dynamically with storage capacity or be available for manual configuration.

@jonded94 and I dug a little deeper to figure out where the RequestCanceled (above) was coming from and found that it's actually happening on the DescribeFileSystemsWithContext call, probably around this line.
From our logs, which we enhanced a bit, we saw that the request that ends in a RequestCanceled takes just under 1 millisecond, while the successful requests take ~200ms. The successful query happens about 9 times in one of our tests, and the 10th time it fails.

Faulty request:

{"ts":1720456920534.6946,"caller":"cloud/cloud.go:392","msg":"DescribeFileSystemsWithContext","v":2,"filesystem":"fs-1234"}
{"ts":1720456920534.7668,"caller":"cloud/cloud.go:394","msg":"DescribeFileSystemsWithContextDone","v":2,"filesystem":"fs-1234","output":"{\n\n}","err":"RequestCanceled: request context canceled\ncaused by: context canceled"}

Working request:

{"ts":1720456860534.6147,"caller":"cloud/cloud.go:392","msg":"DescribeFileSystemsWithContext","v":2,"filesystem":"fs-1234"}
{"ts":1720456860741.3599,"caller":"cloud/cloud.go:394","msg":"DescribeFileSystemsWithContextDone","v":2,"filesystem":"fs-1234","output":"{\n  FileSystems: [{\n      CreationTime: 2024-07-08 16:37:00.357 +0000 UTC,\n      DNSName: \"fs-051ce4079e5c77f58.fsx.eu-west-1.amazonaws.com\",\n      FileSystemId: \"fs-1234\",\n      FileSystemType: \"LUSTRE\",\n      FileSystemTypeVersion: \"2.15\",\n      Lifecycle: \"CREATING\",\n      LustreConfiguration: {\n        CopyTagsToBackups: false,\n        DataCompressionType: \"NONE\",\n        DeploymentType: \"SCRATCH_2\",\n        LogConfiguration: {\n          Level: \"DISABLED\"\n        },\n        MountName: \"wcsrlbev\",\n        WeeklyMaintenanceStartTime: \"2:02:00\"\n      },\n      NetworkInterfaceIds: [\n        \"...a lot of enis ~60..."\n      ],\n      OwnerId: \"1234\",\n      ResourceARN: \"arn:aws:fsx:eu-west-1:1234:file-system/fs-1234\",\n      StorageCapacity: 124800,\n      StorageType: \"SSD\",\n      SubnetIds: [\"subnet-1234\"],\n      Tags: [{\n          Key: \"CSIVolumeName\",\n          Value: \"pvc-1234\"\n        },{\n          Key: \"team\",\n          Value: \"abc\"\n        }],\n      VpcId: \"vpc-1234\"\n    }]\n}","err":null}

We also found that we only get these errors for filesystems larger than 100TB. We have successfully created several <100TB FSx volumes. It seems that the large volumes always fail after 5 minutes of creation and the smaller ones do not run into this RequestCanceled.

Editing comment:
Might be related to: https://github.com/kubernetes-csi/external-provisioner/blob/master/pkg/controller/controller.go#L822

Looking at the readme: https://github.com/kubernetes-csi/external-provisioner/tree/master?tab=readme-ov-file
timeout seems related to

--timeout <duration>: Timeout of all calls to CSI driver. It should be set to value that accommodates majority of ControllerCreateVolume and ControllerDeleteVolume calls. See [CSI error and timeout handling](https://github.com/kubernetes-csi/external-provisioner/tree/master?tab=readme-ov-file#csi-error-and-timeout-handling) for details. 15 seconds is used by default.

which wouldn't be related

It seems more likely that the describe call is catching the context cancelled vs. being the cause itself. Seems likely that the context is being canceled on the CSI side, during either the csiClient.CreateVolume or Provision calls

Potentially this? The 5 minutes would line up with your observations

--retry-interval-max <duration>: Maximum retry interval of failed provisioning or deletion. Default value is 5 minutes. See [CSI error and timeout handling](https://github.com/kubernetes-csi/external-provisioner/tree/master?tab=readme-ov-file#csi-error-and-timeout-handling) for details.

We wanted to do a deeper dive into what the problem is and test when exactly the filesystem creation fails and when it succeeds.
For this it's good to know what it looks like in our account. We currently have a quota of 300TB for Lustre FSx and have manually created a 110TB filesystem that is always present during our tests.

Unsuccessful test

For our first test we created a 150TB volume:

  1. Creating a new PVC with a claim for the 150TB filesystem
  2. The csi-driver logs that it saw the creation of the PVC and starting the provisioning
  3. The fsx-plugin creates the filesystem and we can see in the AWS console that it's being in "Creating"
  4. After 5 minutes the csi-provisioner gets a timeout (the timeout mentioned in the comment above) and logs Retry syncing claim
  5. The fsx-plugin now logs again CreateVolume: called and calls CreateFileSystem which seems to check our quota usage and since we are using 150+110TB out of 300TB it fails because the quota would be exceeded with "another" 150TB filesystem

Successful test

Now we tested with another smaller volume and created a 90TB volume:
(the first 4 steps are exactly the same)
5. The fsx-plugin now logs CreateVolume: called again, but instead of failing it just logs Create volume request for pvc-1234 is already in progress, we think the quota check is successful here, since 90+110TB out of 300TB still leaves room for a 90TB filesystem.

From what we observed in our tests, after the csi provider timed out, the CreateVolume function is called again and instead of checking if a filesystem is already being created, it first checks if the quota would be reached with a "new" filesystem.

Workaround

In a next test, we tried to increase all configurable timeouts, mainly for the csi-provisioner (timeout and retry-interval-max) to 15 minutes, using a local build of our PR #383.
This allowed us to create the 150TB filesystem that previously failed in about 13 minutes. It was only successful because none of the components involved had timeouts that would have triggered another quota check. We also had a test run where the currently active csi driver changed it's leader and this led to another crash loop of the fsx-plugin.
In the end, increasing these timeouts is probably not necessary, as there is a logic for detecting volumes already in progress, but this will cause problems if the quotas are set too tight.

A temporary fix for us now would be to increase the quota for Lustre FSx, and in the long term it would be helpful to change the logic of how an ongoing volume request is detected so that it happens before a new volume is created.

Note: It seems like the quota check is being triggered by a new create api call. Needs further investigation to confirm

Hello, we are trying to create a StorageClass with the following conf:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: fsx-sc
provisioner: fsx.csi.aws.com
parameters:
  subnetId: subnet-123
  securityGroupIds: sg-123
  autoImportPolicy: NONE
  s3ImportPath: s3://s3-bucket
  deploymentType: SCRATCH_2
  fileSystemTypeVersion: "2.15"
mountOptions:
  - flock

And it fails, if I try to create a FS with the same parameters via the UI, it gets created in no-less than 6 minutes, therefore the 5 minute timeout being a little bit problematic, any ideas when would this be solved? Thanks ๐Ÿ™

@JoseAlvarezSonos have you tried creating with using the dynamic provisioning mechanism? The 5 minute timeout shouldn't impact being able to provision a new filesystem unless an additional call would exceed the quota

the quota would be exceeded with "another"

@jacobwolfaws we are using the Dynamic Provisioning with Data Repository example from this repo. That's what you mean by "dynamic provisioning mechanism" right?

@jacobwolfaws we are using the Dynamic Provisioning with Data Repository example from this repo. That's what you mean by "dynamic provisioning mechanism" right?

yes

@jacobwolfaws in that case yes, but we see the exact same problem:

fsx-csi-controller-7c7d54f5f6-th5wh fsx-plugin I0715 14:26:14.476895       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-123" status="CREATING"
fsx-csi-controller-7c7d54f5f6-th5wh fsx-plugin I0715 14:26:44.447783       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-123" status="CREATING"
fsx-csi-controller-7c7d54f5f6-th5wh csi-provisioner I0715 14:27:14.122762       1 controller.go:1082] Temporary error received, adding PVC 123456asdf to claims in progress
fsx-csi-controller-7c7d54f5f6-th5wh csi-provisioner W0715 14:27:14.122810       1 controller.go:934] Retrying syncing claim "123456asdf", failure 5
fsx-csi-controller-7c7d54f5f6-th5wh csi-provisioner E0715 14:27:14.122837       1 controller.go:957] error syncing claim "123456asdf": failed to provision volume with StorageClass "fsx-sc": rpc error: code = DeadlineExceeded desc = context deadline exceeded
fsx-csi-controller-7c7d54f5f6-th5wh csi-provisioner I0715 14:27:14.122866       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dev", Name:"fsx-claim", UID:"123456asdf", APIVersion:"v1", ResourceVersion:"240051348", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "fsx-sc": rpc error: code = DeadlineExceeded desc = context deadline exceeded
fsx-csi-controller-7c7d54f5f6-th5wh fsx-plugin E0715 14:27:14.292413       1 driver.go:104] "GRPC error" err=<
fsx-csi-controller-7c7d54f5f6-th5wh fsx-plugin 	rpc error: code = Internal desc = Filesystem is not ready: RequestCanceled: request context canceled
fsx-csi-controller-7c7d54f5f6-th5wh fsx-plugin 	caused by: context canceled
fsx-csi-controller-7c7d54f5f6-th5wh fsx-plugin  >

BTW I obfuscated some values.

I recommend configuring the csi-provisioner to have retry-interval-max of 15 minutes. I'm not sure if @jon-rei tested it with the driver as is, but it's worth a shot to see if it resolves your issues.

@jacobwolfaws I read that, but the Helm chart doesn't provide a way to configure that. The args, where I assume I would need to add this, are hardcoded and limited to this, any recommendation on that front?

In the short term, you could use kustomize or maintain your own version of the helm chart? This is a gap we'll close before the next csi driver release

@jacobwolfaws several things were involved in @jon-rei and my tests to make a dynamic volume creation of sizes >80TiB possible through this driver.

  • One has to raise the timeouts of the csi-provisioner, otherwise the volume creation will be triggered again which can crash if the Lustre storage quota is exceeded
  • Ontop of that, one also had to use our already provided PR #383 that will increase the internal timeout of the csi-driver of the WaitForFileSystemAvailable call, otherwise this will error out

Have you been testing our PR or could debug the double creation call?

@jacobwolfaws I tried your suggestion but it didn't work. To try "an extreme" to see if that was the issue, I increased the timeout and retry-interval-max to 60m, but it seems that around 10m it fails, which is somehow consistent with the original problem statement from this issue. Any other suggestions?

          args:
            - --csi-address=$(ADDRESS)
            - --v={{ .Values.sidecars.provisioner.logLevel }}
            - --timeout=60m
            - --retry-interval-max=60m
            - --extra-create-metadata
            - --leader-election=true

Logs:

fsx-csi-controller-56b4dbc96-679rb csi-provisioner I0715 16:10:43.104217       1 controller.go:1366] provision "dev/fsx-claim" class "fsx-sc": started
fsx-csi-controller-56b4dbc96-679rb csi-provisioner I0715 16:10:43.105202       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dev", Name:"fsx-claim", UID:"1234asdf", APIVersion:"v1", ResourceVersion:"240151670", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "dev/fsx-claim"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:11:14.844059       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:11:44.761550       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:12:14.762342       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:12:44.797206       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:13:14.756297       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:13:44.761706       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:14:14.833685       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:14:44.756280       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:15:14.771227       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:15:44.753475       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:16:14.763102       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:16:44.780487       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:17:14.751876       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:17:44.779109       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:18:14.754597       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:18:44.750220       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:19:14.745984       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:19:44.755963       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:20:14.751274       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:20:44.722817       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:20:44.932801       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin E0715 16:20:44.932894       1 driver.go:104] "GRPC error" err="rpc error: code = Internal desc = Filesystem is not ready: timed out waiting for the condition"
fsx-csi-controller-56b4dbc96-679rb csi-provisioner I0715 16:20:44.933487       1 controller.go:1075] Final error received, removing PVC 1234asdf from claims in progress
fsx-csi-controller-56b4dbc96-679rb csi-provisioner W0715 16:20:44.933529       1 controller.go:934] Retrying syncing claim "1234asdf", failure 0
fsx-csi-controller-56b4dbc96-679rb csi-provisioner E0715 16:20:44.933575       1 controller.go:957] error syncing claim "1234asdf": failed to provision volume with StorageClass "fsx-sc": rpc error: code = Internal desc = Filesystem is not ready: timed out waiting for the condition
fsx-csi-controller-56b4dbc96-679rb csi-provisioner I0715 16:20:44.933619       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dev", Name:"fsx-claim", UID:"1234asdf", APIVersion:"v1", ResourceVersion:"240151670", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "fsx-sc": rpc error: code = Internal desc = Filesystem is not ready: timed out waiting for the condition

Not an expert here but the 10min timeout in WaitForFileSystemAvailable sounds pretty much like the issue that could be solved by trying out our forked version: #383
Can you try building that and test if that works with it?

Can you also collect the logs from the provisioner and fsx-plugin container

kubectl logs fsx-csi-controller-${CONTROLLER_SUFFIX} -n $NAMESPACE fsx-plugin --v=5 >> fsx-csi-controller-pod-plugin-${CONTROLLER_SUFFIX}.logs
kubectl logs fsx-csi-controller-${CONTROLLER_SUFFIX} -n $NAMESPACE csi-provisioner --v=5 >> fsx-csi-controller-pod-provisioner-${CONTROLLER_SUFFIX}.logs

would also help to increase loglevel to 5 for those containers in the values.yaml file of your chart

@jonded94 I'll try and let you know, thanks

@jacobwolfaws I enabled more logging, but I have tons of leaderelection messages which are not that useful nor interesting, and then the only other outstanding messages are these:

fsx-csi-controller-5656c55766-thkw6 csi-provisioner I0716 09:12:45.091618       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dev", Name:"fsx-claim", UID:"9e02b8ae-8ad8-4605-acbd-aac4a9708996", APIVersion:"v1", ResourceVersion:"240869156", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "dev/fsx-claim"
fsx-csi-controller-5656c55766-thkw6 csi-provisioner I0716 09:12:45.091524       1 connection.go:245] GRPC request: {"capacity_range":{"required_bytes":1288490188800},"name":"pvc-9e02b8ae-8ad8-4605-acbd-aac4a9708996","parameters":{"autoImportPolicy":"NONE","csi.storage.k8s.io/pv/name":"pvc-9e02b8ae-8ad8-4605-acbd-aac4a9708996","csi.storage.k8s.io/pvc/name":"fsx-claim","csi.storage.k8s.io/pvc/namespace":"dev","deploymentType":"SCRATCH_2","extraTags":"service=TAG1,project=TAG2","fileSystemTypeVersion":"2.15","s3ExportPath":"s3://s3-bucket/export","s3ImportPath":"s3://s3-bucket","securityGroupIds":"sg-1234","subnetId":"subnet-1234"},"volume_capabilities":[{"AccessType":{"Mount":{"mount_flags":["flock"]}},"access_mode":{"mode":5}}]}
fsx-csi-controller-5656c55766-thkw6 fsx-plugin I0716 09:12:45.092512       1 controller.go:109] "CreateVolume: called" args={"name":"pvc-9e02b8ae-8ad8-4605-acbd-aac4a9708996","capacity_range":{"required_bytes":1288490188800},"volume_capabilities":[{"AccessType":{"Mount":{"mount_flags":["flock"]}},"access_mode":{"mode":5}}],"parameters":{"autoImportPolicy":"NONE","csi.storage.k8s.io/pv/name":"pvc-9e02b8ae-8ad8-4605-acbd-aac4a9708996","csi.storage.k8s.io/pvc/name":"fsx-claim","csi.storage.k8s.io/pvc/namespace":"dev","deploymentType":"SCRATCH_2","extraTags":"service=TAG1,project=TAG2","fileSystemTypeVersion":"2.15","s3ExportPath":"s3://s3-bucket/export","s3ImportPath":"s3://s3-bucket","securityGroupIds":"sg-1234","subnetId":"subnet-1234"}}

It's worth mentioning that I tried adding an exportPath to see if it helps, but it didn't. I also tried with a smaller S3 bucket (less than 5MB total size), and it worked around 5-10min, but the "important" bucket is around 200TB and in production it will be around twice the size. Nonetheless, if I try to create the FS manually, it works, so maybe there's parameter missing to avoid copying the whole content of the bucket or populating the metadata or whatever is doing that it's taking that long.

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

/remove-lifecycle stale
@JoseAlvarezSonos did increasing the timeout with a custom image help with your issue as @jonded94 suggested?

@jonded94

Have you been testing our PR or could debug the double creation call?

It seems that while the CreateVolume call is idempotent, the CreateFilesystem call is not (after the first CreateVolume reaches its timeout). We recently had an issue with there being orphaned filesystems in the k8s testing account:

#389
#390

and I'm inclined to believe the issues are related. I believe fixing the CreateFilesystem issue / having CreateVolume check for a filesystem already being created should also fix the problems you've been seeing