kubernetes-sigs/aws-fsx-csi-driver

Add initContainers to the helm chart for the node DaemonSet

jon-rei opened this issue · 1 comments

Is your feature request related to a problem? Please describe.

We sometimes get no space left on device errors when running high throughput jobs on our Lustre FSx filesystem. This happens most often when the filesystem is also low on space.
There is an AWS documentation page about this error, see here. It suggests a fix by setting this on the host: sudo lctl set_param osc.*.max_dirty_mb=64.

Describe the solution you'd like in detail

Our idea was to fix this by running an init container similar to this example on the node DaemonSet. This way we are 100% sure that this setting is set before our actual workload starts.

Describe alternatives you've considered

Run an initContainer on our workload pods. But since we are running hundreds of pods, some of which are running on the same node, this is not practical.

Would this be a reasonable approach to fix this problem? If so, I would raise a PR to be able to add an initContainer to the node DaemonSet.

hi @jon-rei, Sorry for taking so long to respond, I think this approach makes sense since Lustre functionality should be maintained in the min base image + it does save the need to have all workload pods running this init container.