Splunk Operator: slow mounting of ebs volume hence pod is keeping "container creating" state for too long

Question

Splunk Operator: slow mounting of ebs volume hence pod is keeping "container creating" state for too long

yaroslav-nakonechnikov opened this issue 8 months ago · 16 comments

yaroslav-nakonechnikov commented 8 months ago

Please select the type of request

Enhancement

Tell us more

Describe the request
as we are using EBS volumes with quite big sizes (10Tb+) for indexers, and sometimes it is requred to change node, we found that mounting of EBS and starting pod takes too much time.
In our case it is 70 minutes just to start start pod after assignment to node.

after investigation, we found that k8s by default forces persmissions. ref: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods
and it takes a lot of time.

Expected behavior
In documenation it is mentioned with some examples how to solve it and crd has default value for fsGroupChangePolicy = "OnRootMismatch"

Answer 1 · 2024-02-21T07:27:41.000Z

@yaroslav-nakonechnikov just wanted to check what version of splunk operator you are using

Answer 2 · 2024-02-21T07:54:51.000Z

@vivekr-splunk crd didn't changed a lot from beginning. But i'd say 2.4 and 2.5 doesn't have that feature.

Answer 3 · 2024-02-21T08:31:18.000Z

@vivekr-splunk @akondur Splunk support ticket has also been raised for that matter. Please refer to the following case number "CASE [3423864]".

Answer 4 · 2024-02-21T17:59:03.000Z

Hi @yaroslav-nakonechnikov , is the request here to change the fsGroupChangePolicy to OnRootMismatch?

Answer 5 · 2024-02-22T07:40:37.000Z

request is to add support for it and inform users about potential issues with big volumes.

as a result it can be changed by default, as from my perspective it doesn't look necessary to change permissions on each mount

Answer 6 · 2024-02-23T01:04:59.000Z

@yaroslav-nakonechnikov , have you tried changing the fsGroupChangePolicy to OnRootMismatch and check if that fixes the issue in your environment? This can be done my manully disabling the operator(temporarily) and testing it on one of your Splunk instances? We are currently evaluating the option on our end.

Answer 7 · 2024-02-23T07:29:27.000Z

@akondur how? any change in statefulset/pod leads to recreate it. and crd doesn't have that option

Answer 8 · 2024-02-23T19:22:20.000Z

@yaroslav-nakonechnikov You could create a simple Splunk statefulSet which attaches to EBS volumes and try reproducing the issue - post which you can change the policy to see if it changes. Alternatively before changing nodes for the pods, you could delete the operator temporarily and edit the statefulSet

Answer 9 · 2024-02-26T07:54:19.000Z

@akondur in that case why you can't recheck it if you already know what and how to recheck?

i reported problem as a customer. now it is your step to get most of it and repeat for it.
Honestly, i don't understand why i have to spin another cluster with another 11Tb disks and fill it all with some dump data? Will you pay for it?

Answer 10 · 2024-02-26T18:11:41.000Z

Hello @yaroslav-nakonechnikov, Thank you for investigating this issue and identifying a possible solution. We will replicate the problem on our end and test to see if your fix resolves it. we will get back to you soon on this

Answer 11 · 2024-02-28T16:58:04.000Z

Hey @yaroslav-nakonechnikov , we have merged the change to update the fsGroupChangePolicy. Please let us know if the issue still persists and we can re-visit the issue.

Answer 12 · 2024-02-29T09:55:53.000Z

@akondur this is good.
so now, need to wait till it will be released.

as for now i don't know how to check it, knowing that fact that 2.5.0 and 2.5.1 also not working as expected.

Answer 13 · 2024-02-29T17:02:38.000Z

@yaroslav-nakonechnikov We have reverted the change as we are going to release 2.5.2 this week. Will re-introduce it right after in develop. If this change is needed soon - we will make another minor release. Will update the PR here as soon as it's ready.

Answer 14 · 2024-03-06T21:34:46.000Z

Hey @yaroslav-nakonechnikov , please find the merged MR into develop here. Please let me know if you're still facing issues with this change.

Answer 15 · 2024-04-16T18:33:13.000Z

Closing this issue per the MR. Please re-open it if the issue still persists.

Answer 16 · 2024-04-17T06:47:37.000Z

how it can be closed, if it is not released yet?