splunk/splunk-operator

Splunk Operator: indexers don't start with 9.1.2

yaroslav-nakonechnikov opened this issue ยท 15 comments

Please select the type of request

Bug

Tell us more

Describe the request
All nodes starting as expected, but only indexers can't

Expected behavior
all works as it was

Splunk setup on K8S
eks

Reproduction/Testing steps

  • start cluster with 9.1.1
    and then replace splunk image with 9.1.2
  • or just try to start cluster with 9.1.2

K8s environment
1.28

Additional context(optional)

so, after several tests i can give what breaks.

in cluster manager definition we had that:

 smartstore:
    defaults:
      maxGlobalDataSizeMB: 0
      maxGlobalRawDataSizeMB: 0
      volumeName: smartstore
    indexes:
    - hotlistBloomFilterRecencyHours: 1
      hotlistRecencySecs: 3600
      name: tf-test
      remotePath: tf-test/
      volumeName: smartstore
    volumes:
    - endpoint: https://s3-eu-central-1.amazonaws.com
      name: smartstore
      path: bucket-for-smart-store
      provider: aws
      region: eu-central-1
      storageType: s3

and when we removed that block and recreated cm and indexers - all started to work.

and it has same behavior with splunk-operator versions 2.4.0 and latest

and, final tests showing, that problem in defaults section.

so, my further investigation leads that splunk-operator creates default settings:

[splunk@splunk-site1-indexer-0 splunk]$ bin/splunk btool indexes  list --debug | grep "\[default\]"
/opt/splunk/etc/peer-apps/splunk-operator/local/indexes.conf                 [default]
[splunk@splunk-site1-indexer-0 splunk]$ cat /opt/splunk/etc/peer-apps/splunk-operator/local/indexes.conf
[default]
repFactor = auto
maxDataSize = auto
homePath = $SPLUNK_DB/$_index_name/db
coldPath = $SPLUNK_DB/$_index_name/colddb
thawedPath = $SPLUNK_DB/$_index_name/thaweddb

[volume:smartstore]
storageType = remote
path = s3://bucket-for-smart-store
remote.s3.endpoint = https://s3-eu-central-1.amazonaws.com
remote.s3.auth_region = eu-central-1

and doesn't work with definition from crd.

also, we had some default settings defined in our custom created app, and it also breaks indexer startup. so something changed which shouldn't be touched.

hello @yaroslav-nakonechnikov are you using IRSA with privatelink

@vivekr-splunk, no, we don't use privatelink.

main point, that with 9.1.1 same config was working fine.

Hello @yaroslav-nakonechnikov this has been fixed in upcoming release of 9.1.3 and 9.0.7 and also in 9.2.1.

still same issue with 9.1.3

FAILED - RETRYING: Restart the splunkd service - Via CLI (5 retries left).
FAILED - RETRYING: Restart the splunkd service - Via CLI (4 retries left).
FAILED - RETRYING: Restart the splunkd service - Via CLI (3 retries left).
FAILED - RETRYING: Restart the splunkd service - Via CLI (2 retries left).
FAILED - RETRYING: Restart the splunkd service - Via CLI (1 retries left).

RUNNING HANDLER [splunk_common : Restart the splunkd service - Via CLI] ********
fatal: [localhost]: FAILED! => {
    "attempts": 60,
    "changed": true,
    "cmd": [
        "/opt/splunk/bin/splunk",
        "restart",
        "--answer-yes",
        "--accept-license"
    ],
    "delta": "0:00:11.173687",
    "end": "2024-01-25 15:06:11.736729",
    "rc": 10,
    "start": "2024-01-25 15:06:00.563042"
}

STDOUT:

splunkd is not running.

Splunk> 4TW

Checking prerequisites...
        Checking mgmt port [8089]: open
        Checking kvstore port [8191]: open
        Checking configuration... Done.


STDERR:

ERROR: pid 5825 terminated with signal 11 (core dumped)
Validating databases (splunkd validatedb) failed with code '-1'.  If you cannot resolve the issue(s) above after consulting documentation, please file a case online at http://www.splunk.com/page/submit_issue


MSG:

non-zero return code
Thursday 25 January 2024  15:06:11 +0000 (0:22:44.302)       0:23:46.336 ******
Thursday 25 January 2024  15:06:11 +0000 (0:00:00.000)       0:23:46.336 ******
Thursday 25 January 2024  15:06:11 +0000 (0:00:00.000)       0:23:46.337 ******

PLAY RECAP *********************************************************************
localhost                  : ok=106  changed=20   unreachable=0    failed=1    skipped=67   rescued=0    ignored=0

Thursday 25 January 2024  15:06:11 +0000 (0:00:00.003)       0:23:46.341 ******
===============================================================================
splunk_common : Restart the splunkd service - Via CLI ---------------- 1364.30s
splunk_common : Restart the splunkd service - Via CLI ------------------ 18.39s
splunk_common : Set options in saml ------------------------------------- 6.26s
splunk_common : Set options in roleMap_SAML ----------------------------- 6.04s
splunk_common : Get Splunk status --------------------------------------- 1.43s
splunk_common : Set node as license slave ------------------------------- 1.17s
splunk_indexer : Update HEC token configuration ------------------------- 1.17s
Gathering Facts --------------------------------------------------------- 1.14s
splunk_indexer : Set current node as indexer cluster peer --------------- 1.12s
splunk_common : Update /opt/splunk/etc ---------------------------------- 0.97s
splunk_indexer : Setup Peers with Associated Site ----------------------- 0.97s
splunk_common : Set options in authentication --------------------------- 0.88s
splunk_common : Test basic https endpoint ------------------------------- 0.79s
splunk_indexer : Setup global HEC --------------------------------------- 0.70s
splunk_indexer : Check for required restarts ---------------------------- 0.68s
Check for required restarts --------------------------------------------- 0.67s
splunk_indexer : Get existing HEC token --------------------------------- 0.67s
splunk_indexer : Check Splunk instance is running ----------------------- 0.67s
splunk_indexer : Check Splunk instance is running ----------------------- 0.66s
splunk_common : Check Splunk instance is running ------------------------ 0.66s

that one looks like fixed in 9.2.*
but still testing

Still hitting with the same error on 9.2.0 and Splunk Operator 2.5.0

@fabiusgoh have you raised ticket in splunk support? may i ask you for its number?

i have not raised a support ticket yet, am in the midst to test it out on 9.1.3 as it is the officially supported version for the operator

i can confirm, 9.2 and 9.2.0.1 starts with our config.
which wasn't working with 9.1.2 and 9.1.3

@yaroslav-nakonechnikov, As we discussed in our meeting, we now understand the issue. This problem arose due to the upgrade path we followed in the 2.5.0 release. Previously, we expected the search head clusters to be running before starting the indexers (if both indexers and SHC are pointing to the same CM). However, since the SHC had trouble starting, the indexers were never created.
As agreed, we will modify the logic to start the indexers parallel to the search head. We'll keep you updated on our progress with these changes.

@vivekr-splunk yep, i agree, it was informative meeting. But this ticket is different, as it is about Splunk logic itself(or splunk-ansible), which was fixed in splunk container starting from 9.2.0.

we were discussing : #1293

also, today i've rechecked 9.1.4 - is it not working as well.
so, 9.1.1 last working version and the last supported version.

all others are broken or not supported.