Splunk Operator: something breaking local config files on pod restart
Closed this issue · 13 comments
Please select the type of request
Bug
Tell us more
Describe the request
We see time to time strange behavior, that config files, which were pushed thru default.yml is broken after pod restart.
[splunk@splunk-prod-cluster-manager-0 splunk]$ cat /opt/splunk/etc/system/local/authentication.conf
[authentication]
authSettings = saml
authType = SAML
authSettings
authType
[saml]
entityId = splunkACSEntityId
fqdn = https://cm.fqdn.cloud
idpSSOUrl = https://idp.fqdn.com/idp/SSO.saml2
inboundDigestMethod = SHA1;SHA256;SHA384;SHA512
inboundSignatureAlgorithm = RSA-SHA1;RSA-SHA256;RSA-SHA384;RSA-SHA512
issuerId = idp:fqdn.com:saml2
lockRoleToFullDN = True
redirectAfterLogoutToUrl = https://www.splunk.com
redirectPort = 443
replicateCertificates = True
signAuthnRequest = True
signatureAlgorithm = RSA-SHA1
signedAssertion = True
sloBinding = HTTP-POST
ssoBinding = HTTP-POST
clientCert = /mnt/certs/saml_sig.pem
idpCertPath = /mnt/certs/
entityId
fqdn
idpSSOUrl
inboundDigestMethod
inboundSignatureAlgorithm
issuerId
lockRoleToFullDN
redirectAfterLogoutToUrl
redirectPort
replicateCertificates
signAuthnRequest
signatureAlgorithm
signedAssertion
sloBinding
ssoBinding
clientCert
idpCertPath
[roleMap_SAML]
admin = ldap-group-a
cloudgateway = ldap-group-b
dashboard = ldap-group-c
ess_admin = ldap-group-d
ess_analyst = ldap-group-e;ldap-group-f;ldap-group-g
...
splunk_soc_l1_l2 = ldap-group-y
splunk_soc_l3 = ldap-group-x
admin
cloudgateway
dashboard
ess_admin
ess_analyst
...
splunk_soc_l1_l2
splunk_soc_l3
so, list of keys were duplicated without value.
Here is a configmap:
[yn@ip-10-224-31-36 /]$ kubectl get configmap splunk-prod-indexer-defaults -o yaml
apiVersion: v1
data:
default.yml: |-
splunk:
site: site1
multisite_master: localhost
all_sites: site1,site2,site3,site4,site5,site6
multisite_replication_factor_origin: 1
multisite_replication_factor_total: 3
multisite_search_factor_origin: 1
multisite_search_factor_total: 3
idxc:
# search_factor: 3
# replication_factor: 3
app_paths_install:
default:
- https://path.to.app/config-explorer_1715.tgz
apps_location:
- https://path.to.app/config-explorer_1715.tgz
app_paths:
idxc: "/opt/splunk/etc/manager-apps"
app_paths_install:
default:
- https://path.to.app/config-explorer_1715.tgz
idxc:
- https://path.to.app/cmp_indexer_indexes.tgz
- https://path.to.app/cmp_resmonitor.tgz
- https://path.to.app/cmp_soar_indexes.tgz
conf:
- key: server
value:
directory: /opt/splunk/etc/system/local
content:
imds:
imds_version: v2
- key: deploymentclient
value:
directory: /opt/splunk/etc/system/local
content:
deployment-client :
disabled : false
target-broker:deploymentServer :
targetUri : ds.shared.cmp-a.internal.cmpgroup.cloud:8089
- key: web
value:
directory: /opt/splunk/etc/system/local
content:
settings:
enableSplunkWebSSL: true
- key: authentication
value:
directory: /opt/splunk/etc/system/local
content:
authentication:
authSettings : saml
authType : SAML
saml:
entityId : splunkACSEntityId
fqdn : https://cm.fqdn.cloud
idpSSOUrl : https://idp.fqdn.com/idp/SSO.saml2
inboundDigestMethod : SHA1;SHA256;SHA384;SHA512
inboundSignatureAlgorithm : RSA-SHA1;RSA-SHA256;RSA-SHA384;RSA-SHA512
issuerId : idp:fqdn.com:saml2
lockRoleToFullDN : true
redirectAfterLogoutToUrl : https://www.splunk.com
redirectPort : 443
replicateCertificates : true
signAuthnRequest : true
signatureAlgorithm : RSA-SHA1
signedAssertion : true
sloBinding : HTTP-POST
ssoBinding : HTTP-POST
clientCert : /mnt/certs/saml_sig.pem
idpCertPath: /mnt/certs/
roleMap_SAML:
admin : ldap-group-a
cloudgateway : ldap-group-b
dashboard : ldap-group-c
ess_admin : ldap-group-d
ess_analyst : ldap-group-e;ldap-group-f;ldap-group-g
...
splunk_soc_l1_l2 : ldap-group-y
splunk_soc_l3 : ldap-group-x
- key: authorize
value:
directory: /opt/splunk/etc/system/local
content:
role_admin:
run_script_adhocremotesearchraw : enabled
run_script_adhocremotesearch : enabled
run_script_environmentpoller : enabled
run_script_sleepy : enabled
kind: ConfigMap
metadata:
creationTimestamp: "2023-02-24T16:53:17Z"
name: splunk-prod-indexer-defaults
namespace: splunk-operator
ownerReferences:
- apiVersion: enterprise.splunk.com/v4
controller: true
kind: ClusterManager
name: prod
uid: 84aa7496-eb5a-4ffb-9549-c42f7780450e
resourceVersion: "95698835"
uid: 47b70fd9-0398-4aa0-ace5-20a5ac9d4842
Expected behavior
default.yml is rendering each run same way. without issues.
Splunk setup on K8S
EKS 1.27
Splunk Operator 2.3.0
Splunk 9.1.0.2
Reproduction/Testing steps
after some unpredicted restart of pod, new pod started with broken config.
same thing happened in etc/system/local/server.conf:
[splunk@splunk-prod-cluster-manager-0 splunk]$ cat etc/system/local/server.conf | grep "\[imds\]" -A 3
[imds]
imds_version = v2
imds_version
and etc/system/local/web.conf
[splunk@splunk-prod-cluster-manager-0 splunk]$ cat etc/system/local/web.conf | grep "\[settings\]" -A 3
[settings]
mgmtHostPort = 0.0.0.0:8089
enableSplunkWebSSL = True
enableSplunkWebSSL
so, each file, which was defined in conf
section is broken.
kubectl delete pod
- initiates recreation of pod, and all seems fine.
But we want to find root cause, as this can happen anywhere!
unmasked diag uploaded in case #3285863
i found how i can replicate issue: delete/stop/whatever with splunk process in pod and in sometime liveness probe will trigger restart of pod and after that you'll see broken config
reported: splunk/splunk-ansible#751
@iaroslav-nakonechnikov we are looking into this issue now, will update you with our findings.
issue still exist in 9.1.1
@yaroslav-nakonechnikov , we are working with splunk-ansible team to fix this issue. will update you once that is done.
was it fixed?
Hi @yaroslav-nakonechnikov , this fix didnt go in 9.1.1 . its planned for 9.1.2 . will update you once the release is complete.
@vivekr-splunk 9.1.2 released, but still no news here.
is there any ETA?
Hello @yaroslav-nakonechnikov this is fixed in 9.1.2 build.
i managed to test it, and yes. it looks like this fixed.
but #1260