splunk/splunk-operator

Splunk Operator: upgrade from 9.0.4.1 to 9.0.5 to 9.1.0.1

Opened this issue · 8 comments

Please select the type of request

Bug

Tell us more

ject: Splunk upgrade from 9.0.5 to 9.1.0.1
Description: According to the previous vulnerability reports on the Splunk, we have upgraded our infrastructure from version 9.0.4.1 to the version 9.0.5, and we have faced the following issues once we completed the upgrade to 9.0.5,

  1. Replication error 9887 "07-26-2023 21:07:36.287 +0000 ERROR TcpOutputFd [6509 TcpOutEloop] - Connection to host=ipaddress:9887 failed"

  2. awscredentials "07-26-2023 21:07:21.883 +0000 ERROR AwsCredentials [821211 TcpChannelThread] - Failed to execute runSync() for transaction with uri=https://sts.amazonaws.com/ http_code=400"

  3. no s3 404 "07-26-2023 16:10:11.044 +0000 ERROR S3Client [222205 cachemanagerDownloadExecutorWorker-3] - command=get transactionId=0x7f6181a95800 rTxnId=0x7f63be4fda00 status=completed success=N uri=https://-eks-prod-smart-store.s3-eu-central-1.amazonaws.com.index/db/c9/67/723~7B2FF365-6D7E-4F72-B43A-727BE109E2F8/guidSplunk-7B2FF365-6D7E-4F72-B43A-727BE109E2F8/SourceTypes.data statusCode=404".

according to instability of our platform and to solve the problems above mentioned, we have decided to go to version 9.1.0.1. Once upgraded to the latest we have noticed that all error’s and issues has been resolved
BUT the size of the historical s3 smart store data got reduced and when we are performing the search on defined time period that we know the data is present, the results are incomplete means not all of the results we expects are being dispatched and being shown on the search head.
@kashok-splunk could you please assist us on this issue.

Hi @logsecvuln , Are you using AWS Private links to connect virtual private clouds. can you access sts.amazoneaws.com from the cluster.

@vivekr-splunk. yes I could access that. that problem arose when I was at version 9.0.5. Now we have upgraded to the version 9.1.0.1 and the problem is that our historical s3 size got smaller and the query results are incomplete meaning for example we have had almost 15k notables from january up until now with the last version, but now the results are like 10k within the same time period and still there are some results being shown on january. so it seems it partially dispatches the results

@logsecvuln , there are 2 different issues

  • 9.0.4.1 to 9.0.5 didnt work. which is strange as you can access global sts end points but splunk code was not able to access these end points, you have moved to 9.1.0.1 which can access global sts end point.
  • S3 size has reduced after moving to 9.1.0.1

I was trying to understand why sts endpoint access failed at 9.0.5 , I will check S2 issues and get back to you

@vivekr-splunk
yes actually that is strange as I didn't change anything on AWS configurations and once we have upgraded to version 9.1.0.1 the sts issues and replications failures didn't show up since then.

And yes unfortunately the bucket size count on cluster manager gives smaller numbers than before.
S3 is presenting the metrics graph for previous days and I am waiting also to validate the real size of the buckets. thanks and looking forward please.

@vivekr-splunk I have also submitted a case to the Splunk support portal with the following number: [3276742].

Is there any update Im seeing the below log message on 9.0.5 as well.

Failed to execute runSync() for transaction with uri=https://sts.amazonaws.com http_code=400

But with the addition of messages around <Message>The provided token has expired.</Message> which i find interesting since its using pod service roles.

@vivekr-splunk I believe Splunk is not doing the fetch of new tokens documented here https://docs.aws.amazon.com/eks/latest/userguide/service-accounts.html Can you confirm.

we are looking into this issue , will get back to you once we find the root cause