druid-io/druid-operator

Druid pods taking more time to come up when security enabled

bharathkuppala opened this issue · 14 comments

Hello All,
druid-operator: v0.0.7
druid: v0.22.1

We are trying to enable basic security for druid and we found out that it was taking Approx. 16 min for entire cluster to come up. Upon investigation on logs we found out that there were multiple retries against coordinator trying to fetch some users data.

Logs from middleManager:
{"timeMillis":1653992838211,"thread":"main","level":"WARN","loggerName":"org.apache.druid.java.util.common.RetryUtils","message":"Retrying (1 of 9) in 727ms.","thrown":{"commonElementCount":0,"localizedMessage":"No content to map due to end-of-input\n at [Source: (byte[])\"\"; line: -1, column: 0]","message":"No content to map due to end-of-input\n at [Source: (byte[])\"\"; line: -1, column: 0]","name":"com.fasterxml.jackson.databind.exc.MismatchedInputException","extendedStackTrace":[{"class":"com.fasterxml.jackson.databind.exc.MismatchedInputException","method":"from","file":"MismatchedInputException.java","line":59,"exact":false,"location":"jackson-databind-2.10.5.1.jar","version":"2.10.5.1"},{"class":"com.fasterxml.jackson.databind.ObjectMapper","method":"_initForReading","file":"ObjectMapper.java","line":4360,"exact":false,"location":"jackson-databind-2.10.5.1.jar","version":"2.10.5.1"},{"class":"com.fasterxml.jackson.databind.ObjectMapper","method":"_readMapAndClose","file":"ObjectMapper.java","line":4205,"exact":false,"location":"jackson-databind-2.10.5.1.jar","version":"2.10.5.1"},{"class":"com.fasterxml.jackson.databind.ObjectMapper","method":"readValue","file":"ObjectMapper.java","line":3292,"exact":false,"location":"jackson-databind-2.10.5.1.jar","version":"2.10.5.1"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"tryFetchUserMapsFromCoordinator","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":400,"exact":false,"location":"?","version":"?"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"lambda$fetchUserAndRoleMapFromCoordinator$4","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":330,"exact":false,"location":"?","version":"?"},{"class":"org.apache.druid.java.util.common.RetryUtils","method":"retry","file":"RetryUtils.java","line":129,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.java.util.common.RetryUtils","method":"retry","file":"RetryUtils.java","line":81,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.java.util.common.RetryUtils","method":"retry","file":"RetryUtils.java","line":163,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.java.util.common.RetryUtils","method":"retry","file":"RetryUtils.java","line":153,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"fetchUserAndRoleMapFromCoordinator","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":328,"exact":true,"location":"druid-basic-security-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"initUserMaps","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":457,"exact":true,"location":"druid-basic-security-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"start","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":116,"exact":true,"location":"druid-basic-security-0.22.1.jar","version":"0.22.1"},{"class":"sun.reflect.NativeMethodAccessorImpl","method":"invoke0","file":"NativeMethodAccessorImpl.java","line":-2,"exact":false,"location":"?","version":"1.8.0_275"},{"class":"sun.reflect.NativeMethodAccessorImpl","method":"invoke","file":"NativeMethodAccessorImpl.java","line":62,"exact":false,"location":"?","version":"1.8.0_275"},{"class":"sun.reflect.DelegatingMethodAccessorImpl","method":"invoke","file":"DelegatingMethodAccessorImpl.java","line":43,"exact":false,"location":"?","version":"1.8.0_275"},{"class":"java.lang.reflect.Method","method":"invoke","file":"Method.java","line":498,"exact":false,"location":"?","version":"1.8.0_275"},{"class":"org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler","method":"start","file":"Lifecycle.java","line":446,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.java.util.common.lifecycle.Lifecycle","method":"start","file":"Lifecycle.java","line":341,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.guice.LifecycleModule$2","method":"start","file":"LifecycleModule.java","line":143,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.cli.GuiceRunnable","method":"initLifecycle","file":"GuiceRunnable.java","line":115,"exact":true,"location":"druid-services-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.cli.ServerRunnable","method":"run","file":"ServerRunnable.java","line":63,"exact":true,"location":"druid-services-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.cli.Main","method":"main","file":"Main.java","line":113,"exact":true,"location":"druid-services-0.22.1.jar","version":"0.22.1"}]},"endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":1,"threadPriority":5}

In the above log its looking for some method which would fetch some information from coordinator and would retry for 9 times.

Wanted to know does druid-operator impose any restrictions upon druid installation like order its broker, historicals and middlemanager which would comeup first during scratch installation then comes coordinator and router?

If above is true then pods will fail looking for coordinator?
Any help on this?

@AdheipSingh : Can you please provide any info on this ? if you know already

sorry missed this out,
the druid operator has nothing do with security enabled on druid or not. Druid operator supports securityContexts on the kubernetes pods it deploys and manages.
When rolling deploy is enabled to true the druid operator upgrades the nodes in the order as described here
https://github.com/druid-io/druid-operator/blob/master/controllers/druid/handler.go#L1765
When the operator installs the cluster all pods are deployed parallel.

Thanks for the rly @AdheipSingh and sry for not responding actually we were testing this and found out that when we give all node types as a statefulset like you said all pods are deployed in parallel, But let's say in my case we are having
Deployments- Broker, Coordinator and Router
Statefulsets - Historicals and MiddleManagers

Now when we install we see along with 2 statefulsets only one deployment is getting up hence stopping other deployments from coming up.

Can you please check this behavior if not already done? Or any suggestions would be of a great help!!

Stack trace
2022-09-08T05:40:35.675Z ERROR controller-runtime.manager.controller.druid Reconciler error {"reconciler group": "druid.apache.org", "reconciler kind": "Druid", "name": "tsdb-apache-druid", "namespace": "nom-backend", "error": "StatefulSet.apps \"druid-tsdb-apache-druid-brokers\" not found"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /home/jenkins/workspace/NM/Neo/operators/druid-operator/druid-operator-druid-operator-0.0.8/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /home/jenkins/workspace/NM/Neo/operators/druid-operator/druid-operator-druid-operator-0.0.8/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214
After above stack trace it would stop near broker preventing other deployments to come up
Thanks!!

Intermediately i see this issue after doing scratch install multiple times. But i always see above stack trace no sure why :(

@bharathkuppala https://github.com/druid-io/druid-operator/blob/master/controllers/druid/handler.go#L182

Be deployment or statefulset in any combination, all pods should come up in parallel for the 1st installation.

Can you tell me the generation number in the CR metadata, if its > 1 and rollingDeploy is kept to true in your CR, upgrades and changes will happen incremental.

Use Kind: Deployemnt to switch to deployment by default its statefulset

@AdheipSingh : Yes like you said there is no problem while doing the upgrade with generation number > 1 it would follow the prescribed order. But somehow i see with podMangamentPolicy as parallel both Historical and MM are deployed next broker would comeup stopping other deployments

We are trying our druid cluster in open shift. Just wanted to know do we have to check something on open shift side to troubleshoot this behavior?
Because in operator we saw it often hitting this line and giving else error.

if apierrors.IsNotFound(err) {

@bharathkuppala IMHO it should not arise, i will try to re-create this issue. Just to double check, did you switch between sts and deployments in Kind ?

@AdheipSingh : Yes i did switch but still i see this issue

druid-tsdb-apache-druid-historicals-0 0/1 Pending 0 0s
druid-tsdb-apache-druid-historicals-0 0/1 Pending 0 0s
druid-tsdb-apache-druid-middlemanagers-0 0/1 Pending 0 0s
druid-tsdb-apache-druid-middlemanagers-0 0/1 Pending 0 1s
druid-tsdb-apache-druid-historicals-1 0/1 Pending 0 1s
druid-tsdb-apache-druid-historicals-1 0/1 Pending 0 1s
druid-tsdb-apache-druid-middlemanagers-1 0/1 Pending 0 1s
druid-tsdb-apache-druid-middlemanagers-1 0/1 Pending 0 1s
druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 Pending 0 1s
druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 Pending 0 1s
druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 Pending 0 1s
druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 Pending 0 1s
druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 Pending 0 1s
druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 ContainerCreating 0 2s
druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 ContainerCreating 0 2s
druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 ContainerCreating 0 2s
druid-tsdb-apache-druid-historicals-0 0/1 Pending 0 2s
druid-tsdb-apache-druid-middlemanagers-0 0/1 Pending 0 2s
druid-tsdb-apache-druid-middlemanagers-1 0/1 Pending 0 2s
druid-tsdb-apache-druid-historicals-1 0/1 Pending 0 2s
druid-tsdb-apache-druid-middlemanagers-1 0/1 Pending 0 2s
druid-tsdb-apache-druid-historicals-0 0/1 Pending 0 2s
druid-tsdb-apache-druid-middlemanagers-0 0/1 ContainerCreating 0 2s
druid-tsdb-apache-druid-historicals-0 0/1 ContainerCreating 0 2s
druid-tsdb-apache-druid-middlemanagers-1 0/1 ContainerCreating 0 3s
druid-tsdb-apache-druid-historicals-1 0/1 Pending 0 3s
druid-tsdb-apache-druid-middlemanagers-0 0/1 ContainerCreating 0 3s
druid-tsdb-apache-druid-historicals-1 0/1 ContainerCreating 0 3s
druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 ContainerCreating 0 3s
druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 ContainerCreating 0 3s
druid-tsdb-apache-druid-historicals-0 0/1 ContainerCreating 0 6s
druid-tsdb-apache-druid-middlemanagers-1 0/1 ContainerCreating 0 8s
druid-tsdb-apache-druid-middlemanagers-0 0/1 ContainerCreating 0 8s
druid-tsdb-apache-druid-historicals-1 0/1 ContainerCreating 0 9s
druid-tsdb-apache-druid-historicals-1 0/1 Running 0 1m
druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 Running 0 1m
druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 Running 0 1m
druid-tsdb-apache-druid-middlemanagers-1 0/1 Running 0 1m
druid-tsdb-apache-druid-historicals-0 0/1 Running 0 1m
druid-tsdb-apache-druid-middlemanagers-0 0/1 Running 0 2m
druid-tsdb-apache-druid-historicals-0 1/1 Running 0 13m
druid-tsdb-apache-druid-middlemanagers-1 1/1 Running 0 13m
druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 1/1 Running 0 14m
druid-tsdb-apache-druid-brokers-86c4849b47-4598x 1/1 Running 0 14m
druid-tsdb-apache-druid-middlemanagers-0 1/1 Running 0 14m
druid-tsdb-apache-druid-historicals-1 1/1 Running 0 14m

druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 Pending 0 0s
druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 Pending 0 1s
druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 Pending 0 1s
druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 Pending 0 2s
druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 Pending 0 2s
druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 ContainerCreating 0 2s
druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 Pending 0 2s
druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 ContainerCreating 0 3s
druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 ContainerCreating 0 3s
druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 ContainerCreating 0 3s
druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 Running 0 3s
druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 Running 0 4s
druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 1/1 Running 0 30s
druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 1/1 Running 0 30s
druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 Pending 0 0s
druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 Pending 0 0s
druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 Pending 0 0s
druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 Pending 0 0s
druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 Pending 0 1s
druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 Pending 0 1s
druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 ContainerCreating 0 1s
druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 ContainerCreating 0 1s
druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 ContainerCreating 0 2s
druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 ContainerCreating 0 3s
druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 Running 0 3s
druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 Running 0 4s

This is what is happening in my current setup once historical, MM and Broker pods are up then only coordinator and routers are triggered

are you adding any node selectors ? are you coordinator/routers get scheduled on different nodes
if you are rollingDeploy is set to false, all pods should come in parallel, there is no conditional check to wait for druid nodes on the first installation.

@AdheipSingh : No we are not using any node selectors for scheduling the pods. And we are using rollingDeploy as true because we want our pods to have an upgrade based on prescribed order.

Yea i see on druid-operator code we are not checking generation so pods should come in parallel. I can share logs of druid-operator so we could have a complete trace.

@bharathkuppala whats your podmanagement policy ? try setting to Parallel

@AdheipSingh : Yes it is by default parallel right for sts?