splunk/splunk-operator

App Framework: Add option to use path style s3 URLs

paheath opened this issue · 28 comments

Please select the type of request

Enhancement

Tell us more

Describe the request
I am deploying the operator in an on-prem environment with a storage solution that only supports path style s3 URLs. As far as I can tell, the operator defaults to using virtual host style s3 URLs to download apps. I propose making the current behavior remain the default, and provide an option in the AppFramework spec to explicitly set the s3 URLs to path style. I rebuilt the operator with S3ForcePathStyle: aws.Bool(true) added here and the app framework worked as expected.

Smartstore offers a similar option to specify the url version, and defaults to path style. See remote.s3.url_version here.

Expected behavior
Force the s3 client to use path style URLs when downloading apps, when set as such in the AppFramework spec.

Splunk setup on K8S
SearchHeadCluster, IndexerCluster, ClusterManager, LicenseManager, MonitoringConsole, and Standalone heavy forwarder.

Reproduction/Testing steps
Enable path style s3 URLs via the AppFramework spec. Verify that apps are correctly downloaded and installed.

K8s environment
On-prem k8s cluster with on-prem s3-compatible NAS.

i guess this is related: #1030 (comment)

Hello @yaroslav-nakonechnikov @paheath we will work on this change and get back to you

Hello @paheath , we are exploring possible solutions to the path style S3 URLs. Meanwhile, can you please provide an example of the working(with the modified Splunk operator image) appFramework configurations for the path style URLs?

Also, path style URLs will be discontinued per AWS documentation.

Currently, Amazon S3 supports both virtual-hosted–style and path-style URL access in all AWS Regions. However, path-style URLs will be discontinued in the future. For more information, see the following Important note.

This is an excerpt from my helm chart, and the underlying operator image is modified as indicated in the original bug description. I don't think any of the value substitutions necessarily impact the functionality. I've defined it in the yaml as documented here https://splunk.github.io/splunk-operator/AppFramework.html

appRepo:
  appsRepoPollIntervalSeconds: {{ .Values.configPollInterval }}
  defaults:
    volumeName: {{ .Values.volumeName }}
  appSources:
  - name: node
    location: node/
    scope: local
  volumes:
  - name: {{ .Values.volumeName }}
    storageType: s3
    path: {{ .Values.bucketPath }}/
    provider: aws
    region: {{ .Values.bucketRegion }}
    endpoint: {{ .Values.bucketEndpoint }}
    secretRef: {{ .Values.secretRef }}

Hi @paheath , thanks for the example above. To further test our solution, are you able to let us know the storage provider being used to test path style S3 URLs? Currently, by default AWS S3 buckets support both path style as well as virtual hosted. I was able to test path style specifically on S3 buckets.

I'm testing against an on-prem s3-compatible NAS. I think testing against any s3-compatible storage might be sufficient, as long as you can confirm the outbound request is hitting the path-style endpoint when configured to do so. Maybe even locally block outbound traffic to the virtual endpoint. Testing might be similar to how the smartstore path-style config is tested.

@paheath Are you able to test the changes in the MR to see if its working before we merge? If there is something missing, please comment on the MR or here it will be fixed.

@paheath Please let us know if this solution works so we can merge it.

Unfortunately I can't get this change to work. I'm seeing my clustermanager instance reporting Ready, but all the apps in the description report this:

        appDeploymentInfo:                                                                                                                                                                    
        - appName: myapp.tgz                                                                                                                                                           
          appPackageTopFolder: ""                                                                                                                                                             
          deployStatus: 1                                                                                                                                                                     
          isUpdate: false                                                                                                                                                                     
          objectHash: <hash>                                                                                                                                        
          phaseInfo:                                                                                                                                                                          
            failCount: 3                                                                                                                                                                      
            phase: download                                                                                                                                                                   
            status: 199                                                                                                                                                                       
          repoState: 1

and the associated indexer cluster never reconciles. I don't see the apps appear in the pod under /opt/splunk/etc/apps or /opt/splunk/etc/manager-apps

Hey @paheath , can you share any Splunk Operator pod logs indicating any errors?

The CR status code 199 indicates that the app package was not downloaded properly.

Appears to be running through this periodically for the nodes using app framework:

2024-06-04T00:47:27.481032478Z  INFO    updatePplnWorkerPhaseInfo   changing the status {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "56c9a258-e763-484e-9ffe-6888469133de", "appName": "app.tgz", "old status": "Download In Progress", "new status": "Download Pending"}
2024-06-04T00:47:27.657331829Z  INFO    downloadPhaseManager    Download worker got a run slot  {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "56c9a258-e763-484e-9ffe-6888469133de", "name": "lm", "namespace": "test", "App name": "app.tgz", "digest": "<digest>"} 
2024-06-04T00:47:27.663811632Z  INFO    isAppAlreadyDownloaded  App not present on operator pod {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "56c9a258-e763-484e-9ffe-6888469133de", "app name": "app.tgz"}
2024-06-04T00:47:27.663872366Z  INFO    updatePplnWorkerPhaseInfo   changing the status {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "56c9a258-e763-484e-9ffe-6888469133de", "appName": "app.tgz", "old status": "Download Pending", "new status": "Download In Progress"}
2024-06-04T00:47:27.664103782Z  INFO    GetRemoteStorageClient  Creating the client {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "56c9a258-e763-484e-9ffe-6888469133de", "name": "lm", "namespace": "test", "volume": "config-repo", "bucket": "<bucket>", "bucket path": "lic_manager/"}
2024-06-04T00:47:27.664283386Z  INFO    InitAWSClientSession    AWS Client Session initialization successful.   {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "56c9a258-e763-484e-9ffe-6888469133de", "region": "zone1", "TLS Version": "TLS 1.2"}
2024-06-04T00:47:27.820996027Z  ERROR   DownloadApp Unable to download item {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "56c9a258-e763-484e-9ffe-6888469133de", "remoteFile": "lic_manager/app.tgz", "localFile": "/opt/splunk/appframework/downloadedApps/test/LicenseManager/lm/local/lic_manager/app.tgz_<etag>", "etag": "<etag>", "RemoteFile": "lic_manager/app.tgz", "error": "stream error: stream ID 7; NO_ERROR; received from peer"}
github.com/splunk/splunk-operator/pkg/splunk/client.(*AWSS3Client).DownloadApp
    /workspace/pkg/splunk/client/awss3client.go:277
github.com/splunk/splunk-operator/pkg/splunk/enterprise.(*RemoteDataClientManager).DownloadApp
    /workspace/pkg/splunk/enterprise/util.go:842
github.com/splunk/splunk-operator/pkg/splunk/enterprise.(*PipelineWorker).download
    /workspace/pkg/splunk/enterprise/afwscheduler.go:497
2024-06-04T00:47:27.821131931Z  ERROR   PipelineWorker.Download()   unable to download app  {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "56c9a258-e763-484e-9ffe-6888469133de", "name": "lm", "namespace": "test", "App name": "app.tgz", "objectHash": "<digest>", "appName": "app.tgz", "error": "stream error: stream ID 7; NO_ERROR; received from peer"}
github.com/splunk/splunk-operator/pkg/splunk/enterprise.(*PipelineWorker).download
    /workspace/pkg/splunk/enterprise/afwscheduler.go:499

This is the cluster manager app framework spec I'm using. Same as before with s3PathUrl: true set.

appRepo:
  appsRepoPollIntervalSeconds: {{ .Values.configPollInterval }}
  defaults:
    volumeName: {{ .Values.volumeName }}
  appSources:
  - name: node
    location: node/
    scope: local
  volumes:
  - name: {{ .Values.volumeName }}
    storageType: s3
    path: {{ .Values.bucketPath }}/
    provider: aws
    region: {{ .Values.bucketRegion }}
    s3PathUrl: true
    endpoint: {{ .Values.bucketEndpoint }}
    secretRef: {{ .Values.secretRef }}

Hey @paheath , whilst we are debugging further were you able to successfully install the new CRDs on the new cluster before deploying the clusterManager CR? Please let us know.

Yes, I updated the CRDs beforehand. And the cluster manager accepted the s3PathUrl setting.

Well, maybe it did not take. In the cluster manager spec s3PathUrl is set to true. But when I describe the cluster manager, I see status.Smartstore.Volumes.s3PathUrl is false. Was s3PathUrl added for smartstore also?

Disregard, I see status.AppContext.AppRepo.AppSources.Volumes.s3PathUrl is set to true as expected. I didn't catch that the false setting was in the smartstore status section.

Thank you @paheath . I believe we are setting the pathStyleUrl in the AWS S3 client. It is an update of the S3 client(vs during creation in your successful example here) before creating the downloader. Some posts online don't recommend updating the client once created. I will try and cater the changes to update this option during creation.

@paheath Are you able to try it out with the latest changes?

@paheath Please let us know if the latest changes are working.

Forgive me, my bandwidth is limited at the moment. I will do my best to get to this today.

With the latest patch I'm seeing the same "unable to download item" error logs as before. The general behavior is also the same, blocking indexer cluster creation.

Hi @paheath , thank you for testing. Are you able to provide us Splunk operator pod logs similar to this:

2024-06-06T01:03:17.019639356Z  INFO    InitAWSClientSession    Setting up AWS SDK client       {"controller": "standalone", "controllerGroup": "enterprise.splunk.com", "controllerKind": "Standalone", "Standalone": {"name":"example","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "ido", "reconcileID": "4c684039-fe1b-4bea-b550-ce618f2ef57e", "regionWithEndpoint": "us-west-2|https://s3-us-west-2.amazonaws.com", "pathStyleUrl": true}

The changes in the MR are made are keeping in mind this issue's description and changes were made here.

I see similar logs for all nodes using the app framework (standalone, licensemanager, clustermanager)

2024-06-14T21:32:41.612345298Z  INFO    InitAWSClientSession    Setting up AWS SDK client       {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "<id>", "regionWithEndpoint": "zone1|https://<endpoint-fqdn>", "pathStyleUrl": true}
2024-06-14T21:32:41.61252801Z   INFO    InitAWSClientSession    AWS Client Session initialization successful.   {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "<id>", "region": "<region>", "TLS Version": "TLS 1.2"}

Hey @paheath , in the MR, the field S3ForcePathStyle of aws.Config is being set here per your original request. Were there any other changes made to make this work? If not, are you able to open a customer JIRA with Splunk Support so we can debug the issue further?

I've been able to test this a little more thoroughly today. I only had to add that one line to make this work previously, but I was testing on top of 2.4.0. I was able to reproduce this successfully on top of 2.4.0 today, but cherrypicking the one-line change on top of 2.5.2 did not work. Can you think of anything that has changed between 2.4.0 and 2.5.2 that would affect the behavior of the aws s3 client? I compared the two releases, but I couldn't see anything obvious. I assume whatever is breaking this in 2.5.2 is also breaking your PR.

Hi @paheath , after the comparison between 2.4.0 and 2.5.2 I couldn't see any major differences that would cause the aws sdk client to behave differently.

We just released 2.6.0. The MR has been rebased. Could you please try with the new version?

Hey @paheath , did you get a chance to try with 2.6.0? If it's not working can you please open a Splunk support case with these details?

Closing the issue for now. Please re-open a Splunk support ticket if the issue persists.