Splunk Operator: something breaking local config files on pod restart

Question

Splunk Operator: something breaking local config files on pod restart

Closed this issue 7 months ago · 13 comments

yaroslav-nakonechnikov commented a year ago

Please select the type of request

Bug

Tell us more

Describe the request
We see time to time strange behavior, that config files, which were pushed thru default.yml is broken after pod restart.

[splunk@splunk-prod-cluster-manager-0 splunk]$ cat /opt/splunk/etc/system/local/authentication.conf
 
[authentication]
authSettings = saml
authType = SAML
authSettings
authType
 
[saml]
entityId = splunkACSEntityId
fqdn = https://cm.fqdn.cloud
idpSSOUrl = https://idp.fqdn.com/idp/SSO.saml2
inboundDigestMethod = SHA1;SHA256;SHA384;SHA512
inboundSignatureAlgorithm = RSA-SHA1;RSA-SHA256;RSA-SHA384;RSA-SHA512
issuerId = idp:fqdn.com:saml2
lockRoleToFullDN = True
redirectAfterLogoutToUrl = https://www.splunk.com
redirectPort = 443
replicateCertificates = True
signAuthnRequest = True
signatureAlgorithm = RSA-SHA1
signedAssertion = True
sloBinding = HTTP-POST
ssoBinding = HTTP-POST
clientCert = /mnt/certs/saml_sig.pem
idpCertPath = /mnt/certs/
entityId
fqdn
idpSSOUrl
inboundDigestMethod
inboundSignatureAlgorithm
issuerId
lockRoleToFullDN
redirectAfterLogoutToUrl
redirectPort
replicateCertificates
signAuthnRequest
signatureAlgorithm
signedAssertion
sloBinding
ssoBinding
clientCert
idpCertPath
 
[roleMap_SAML]
admin = ldap-group-a
cloudgateway = ldap-group-b
dashboard = ldap-group-c
ess_admin = ldap-group-d
ess_analyst = ldap-group-e;ldap-group-f;ldap-group-g
...
splunk_soc_l1_l2 = ldap-group-y
splunk_soc_l3 = ldap-group-x
admin
cloudgateway
dashboard
ess_admin
ess_analyst
...
splunk_soc_l1_l2
splunk_soc_l3

so, list of keys were duplicated without value.

Here is a configmap:

[yn@ip-10-224-31-36 /]$ kubectl get configmap splunk-prod-indexer-defaults -o yaml
apiVersion: v1
data:
  default.yml: |-
    splunk:
      site: site1
      multisite_master: localhost
      all_sites: site1,site2,site3,site4,site5,site6
      multisite_replication_factor_origin: 1
      multisite_replication_factor_total: 3
      multisite_search_factor_origin: 1
      multisite_search_factor_total: 3
      idxc:
        # search_factor: 3
        # replication_factor: 3
        app_paths_install:
          default:
            - https://path.to.app/config-explorer_1715.tgz
        apps_location:
          - https://path.to.app/config-explorer_1715.tgz
      app_paths:
        idxc: "/opt/splunk/etc/manager-apps"
      app_paths_install:
        default:
          - https://path.to.app/config-explorer_1715.tgz
        idxc:
          - https://path.to.app/cmp_indexer_indexes.tgz
          - https://path.to.app/cmp_resmonitor.tgz
          - https://path.to.app/cmp_soar_indexes.tgz
      conf:
        - key: server
          value:
            directory: /opt/splunk/etc/system/local
            content:
              imds:
                imds_version: v2
        - key: deploymentclient
          value:
            directory: /opt/splunk/etc/system/local
            content:
              deployment-client :
                disabled : false
              target-broker:deploymentServer :
                targetUri : ds.shared.cmp-a.internal.cmpgroup.cloud:8089
        - key: web
          value:
            directory: /opt/splunk/etc/system/local
            content:
              settings:
                enableSplunkWebSSL: true
        - key: authentication
          value:
            directory: /opt/splunk/etc/system/local
            content:
              authentication:
                authSettings : saml
                authType : SAML
              saml:
                entityId : splunkACSEntityId
                fqdn : https://cm.fqdn.cloud
                idpSSOUrl : https://idp.fqdn.com/idp/SSO.saml2
                inboundDigestMethod : SHA1;SHA256;SHA384;SHA512
                inboundSignatureAlgorithm : RSA-SHA1;RSA-SHA256;RSA-SHA384;RSA-SHA512
                issuerId : idp:fqdn.com:saml2
                lockRoleToFullDN : true
                redirectAfterLogoutToUrl : https://www.splunk.com
                redirectPort : 443
                replicateCertificates : true
                signAuthnRequest : true
                signatureAlgorithm : RSA-SHA1
                signedAssertion : true
                sloBinding : HTTP-POST
                ssoBinding : HTTP-POST
                clientCert : /mnt/certs/saml_sig.pem
                idpCertPath: /mnt/certs/
              roleMap_SAML:
                admin : ldap-group-a
                cloudgateway : ldap-group-b
                dashboard : ldap-group-c
                ess_admin : ldap-group-d
                ess_analyst : ldap-group-e;ldap-group-f;ldap-group-g
                ...
                splunk_soc_l1_l2 : ldap-group-y
                splunk_soc_l3 : ldap-group-x
        - key: authorize
          value:
            directory: /opt/splunk/etc/system/local
            content:
              role_admin:
                run_script_adhocremotesearchraw : enabled
                run_script_adhocremotesearch : enabled
                run_script_environmentpoller : enabled
                run_script_sleepy : enabled
kind: ConfigMap
metadata:
  creationTimestamp: "2023-02-24T16:53:17Z"
  name: splunk-prod-indexer-defaults
  namespace: splunk-operator
  ownerReferences:
  - apiVersion: enterprise.splunk.com/v4
    controller: true
    kind: ClusterManager
    name: prod
    uid: 84aa7496-eb5a-4ffb-9549-c42f7780450e
  resourceVersion: "95698835"
  uid: 47b70fd9-0398-4aa0-ace5-20a5ac9d4842

Expected behavior
default.yml is rendering each run same way. without issues.

Splunk setup on K8S
EKS 1.27
Splunk Operator 2.3.0
Splunk 9.1.0.2

Reproduction/Testing steps
after some unpredicted restart of pod, new pod started with broken config.

Answer 1 · 2023-08-09T12:00:56.000Z

same thing happened in etc/system/local/server.conf:

[splunk@splunk-prod-cluster-manager-0 splunk]$ cat etc/system/local/server.conf | grep "\[imds\]" -A 3
[imds]
imds_version = v2
imds_version

and etc/system/local/web.conf

[splunk@splunk-prod-cluster-manager-0 splunk]$ cat etc/system/local/web.conf | grep "\[settings\]" -A 3
[settings]
mgmtHostPort = 0.0.0.0:8089
enableSplunkWebSSL = True
enableSplunkWebSSL

so, each file, which was defined in conf section is broken.

Answer 2 · 2023-08-09T12:10:56.000Z

kubectl delete pod - initiates recreation of pod, and all seems fine.
But we want to find root cause, as this can happen anywhere!

Answer 3 · 2023-08-09T12:17:51.000Z

unmasked diag uploaded in case #3285863

Answer 4 · 2023-08-09T12:58:00.000Z

i found how i can replicate issue: delete/stop/whatever with splunk process in pod and in sometime liveness probe will trigger restart of pod and after that you'll see broken config

Answer 5 · 2023-08-09T13:07:39.000Z

reported: splunk/splunk-ansible#751

Answer 6 · 2023-08-15T16:00:23.000Z

@iaroslav-nakonechnikov we are looking into this issue now, will update you with our findings.

Answer 7 · 2023-09-12T06:56:51.000Z

issue still exist in 9.1.1

Answer 8 · 2023-09-12T16:22:31.000Z

@yaroslav-nakonechnikov , we are working with splunk-ansible team to fix this issue. will update you once that is done.

Answer 9 · 2023-10-16T11:15:04.000Z

was it fixed?

Answer 10 · 2023-11-09T17:50:26.000Z

Hi @yaroslav-nakonechnikov , this fix didnt go in 9.1.1 . its planned for 9.1.2 . will update you once the release is complete.

Answer 11 · 2023-12-06T07:53:55.000Z

@vivekr-splunk 9.1.2 released, but still no news here.
is there any ETA?

Answer 12 · 2023-12-06T19:00:25.000Z

Hello @yaroslav-nakonechnikov this is fixed in 9.1.2 build.

Answer 13 · 2023-12-13T09:37:38.000Z

i managed to test it, and yes. it looks like this fixed.
but #1260