splunk/splunk-operator

Splunk Operator: something breaking local config files on pod restart

Closed this issue · 13 comments

Please select the type of request

Bug

Tell us more

Describe the request
We see time to time strange behavior, that config files, which were pushed thru default.yml is broken after pod restart.

[splunk@splunk-prod-cluster-manager-0 splunk]$ cat /opt/splunk/etc/system/local/authentication.conf
 
[authentication]
authSettings = saml
authType = SAML
authSettings
authType
 
[saml]
entityId = splunkACSEntityId
fqdn = https://cm.fqdn.cloud
idpSSOUrl = https://idp.fqdn.com/idp/SSO.saml2
inboundDigestMethod = SHA1;SHA256;SHA384;SHA512
inboundSignatureAlgorithm = RSA-SHA1;RSA-SHA256;RSA-SHA384;RSA-SHA512
issuerId = idp:fqdn.com:saml2
lockRoleToFullDN = True
redirectAfterLogoutToUrl = https://www.splunk.com
redirectPort = 443
replicateCertificates = True
signAuthnRequest = True
signatureAlgorithm = RSA-SHA1
signedAssertion = True
sloBinding = HTTP-POST
ssoBinding = HTTP-POST
clientCert = /mnt/certs/saml_sig.pem
idpCertPath = /mnt/certs/
entityId
fqdn
idpSSOUrl
inboundDigestMethod
inboundSignatureAlgorithm
issuerId
lockRoleToFullDN
redirectAfterLogoutToUrl
redirectPort
replicateCertificates
signAuthnRequest
signatureAlgorithm
signedAssertion
sloBinding
ssoBinding
clientCert
idpCertPath
 
[roleMap_SAML]
admin = ldap-group-a
cloudgateway = ldap-group-b
dashboard = ldap-group-c
ess_admin = ldap-group-d
ess_analyst = ldap-group-e;ldap-group-f;ldap-group-g
...
splunk_soc_l1_l2 = ldap-group-y
splunk_soc_l3 = ldap-group-x
admin
cloudgateway
dashboard
ess_admin
ess_analyst
...
splunk_soc_l1_l2
splunk_soc_l3

so, list of keys were duplicated without value.

Here is a configmap:

[yn@ip-10-224-31-36 /]$ kubectl get configmap splunk-prod-indexer-defaults -o yaml
apiVersion: v1
data:
  default.yml: |-
    splunk:
      site: site1
      multisite_master: localhost
      all_sites: site1,site2,site3,site4,site5,site6
      multisite_replication_factor_origin: 1
      multisite_replication_factor_total: 3
      multisite_search_factor_origin: 1
      multisite_search_factor_total: 3
      idxc:
        # search_factor: 3
        # replication_factor: 3
        app_paths_install:
          default:
            - https://path.to.app/config-explorer_1715.tgz
        apps_location:
          - https://path.to.app/config-explorer_1715.tgz
      app_paths:
        idxc: "/opt/splunk/etc/manager-apps"
      app_paths_install:
        default:
          - https://path.to.app/config-explorer_1715.tgz
        idxc:
          - https://path.to.app/cmp_indexer_indexes.tgz
          - https://path.to.app/cmp_resmonitor.tgz
          - https://path.to.app/cmp_soar_indexes.tgz
      conf:
        - key: server
          value:
            directory: /opt/splunk/etc/system/local
            content:
              imds:
                imds_version: v2
        - key: deploymentclient
          value:
            directory: /opt/splunk/etc/system/local
            content:
              deployment-client :
                disabled : false
              target-broker:deploymentServer :
                targetUri : ds.shared.cmp-a.internal.cmpgroup.cloud:8089
        - key: web
          value:
            directory: /opt/splunk/etc/system/local
            content:
              settings:
                enableSplunkWebSSL: true
        - key: authentication
          value:
            directory: /opt/splunk/etc/system/local
            content:
              authentication:
                authSettings : saml
                authType : SAML
              saml:
                entityId : splunkACSEntityId
                fqdn : https://cm.fqdn.cloud
                idpSSOUrl : https://idp.fqdn.com/idp/SSO.saml2
                inboundDigestMethod : SHA1;SHA256;SHA384;SHA512
                inboundSignatureAlgorithm : RSA-SHA1;RSA-SHA256;RSA-SHA384;RSA-SHA512
                issuerId : idp:fqdn.com:saml2
                lockRoleToFullDN : true
                redirectAfterLogoutToUrl : https://www.splunk.com
                redirectPort : 443
                replicateCertificates : true
                signAuthnRequest : true
                signatureAlgorithm : RSA-SHA1
                signedAssertion : true
                sloBinding : HTTP-POST
                ssoBinding : HTTP-POST
                clientCert : /mnt/certs/saml_sig.pem
                idpCertPath: /mnt/certs/
              roleMap_SAML:
                admin : ldap-group-a
                cloudgateway : ldap-group-b
                dashboard : ldap-group-c
                ess_admin : ldap-group-d
                ess_analyst : ldap-group-e;ldap-group-f;ldap-group-g
                ...
                splunk_soc_l1_l2 : ldap-group-y
                splunk_soc_l3 : ldap-group-x
        - key: authorize
          value:
            directory: /opt/splunk/etc/system/local
            content:
              role_admin:
                run_script_adhocremotesearchraw : enabled
                run_script_adhocremotesearch : enabled
                run_script_environmentpoller : enabled
                run_script_sleepy : enabled
kind: ConfigMap
metadata:
  creationTimestamp: "2023-02-24T16:53:17Z"
  name: splunk-prod-indexer-defaults
  namespace: splunk-operator
  ownerReferences:
  - apiVersion: enterprise.splunk.com/v4
    controller: true
    kind: ClusterManager
    name: prod
    uid: 84aa7496-eb5a-4ffb-9549-c42f7780450e
  resourceVersion: "95698835"
  uid: 47b70fd9-0398-4aa0-ace5-20a5ac9d4842

Expected behavior
default.yml is rendering each run same way. without issues.

Splunk setup on K8S
EKS 1.27
Splunk Operator 2.3.0
Splunk 9.1.0.2

Reproduction/Testing steps
after some unpredicted restart of pod, new pod started with broken config.

same thing happened in etc/system/local/server.conf:

[splunk@splunk-prod-cluster-manager-0 splunk]$ cat etc/system/local/server.conf | grep "\[imds\]" -A 3
[imds]
imds_version = v2
imds_version

and etc/system/local/web.conf

[splunk@splunk-prod-cluster-manager-0 splunk]$ cat etc/system/local/web.conf | grep "\[settings\]" -A 3
[settings]
mgmtHostPort = 0.0.0.0:8089
enableSplunkWebSSL = True
enableSplunkWebSSL

so, each file, which was defined in conf section is broken.

kubectl delete pod - initiates recreation of pod, and all seems fine.
But we want to find root cause, as this can happen anywhere!

unmasked diag uploaded in case #3285863

i found how i can replicate issue: delete/stop/whatever with splunk process in pod and in sometime liveness probe will trigger restart of pod and after that you'll see broken config

@iaroslav-nakonechnikov we are looking into this issue now, will update you with our findings.

issue still exist in 9.1.1

@yaroslav-nakonechnikov , we are working with splunk-ansible team to fix this issue. will update you once that is done.

was it fixed?

Hi @yaroslav-nakonechnikov , this fix didnt go in 9.1.1 . its planned for 9.1.2 . will update you once the release is complete.

@vivekr-splunk 9.1.2 released, but still no news here.
is there any ETA?

Hello @yaroslav-nakonechnikov this is fixed in 9.1.2 build.

i managed to test it, and yes. it looks like this fixed.
but #1260