apache/helix

Ideal state updated even though `ZkBaseDataAccessor.set` call returned non successful error code

saurabhd336 opened this issue · 2 comments

Running into an issue with https://github.com/apache/pinot using helix version 1.3.1, wherein there seems to be an inconsistency b/w the returned status code of the ZkBaseDataAccessor.set vs the expected outcome. The set call to update the idealState here, fails with retCode != OK, but the IdealState ZNode does get updated successfully. This is causing unexpected behaviour further downstream in the system, which relies on the return code of the set call.

Relevant logs from helix calls

2023/10/15 21:13:08.573 ERROR [ZkBaseDataAccessor] [jersey-server-managed-async-executor-24] Exception while setting path: /pinot/IDEALSTATES/myTable_OFFLINE
org.apache.helix.zookeeper.zkclient.exception.ZkInterruptedException: java.lang.InterruptedException
        at org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:2002) 
        at org.apache.helix.zookeeper.zkclient.ZkClient.writeDataReturnStat(ZkClient.java:2220) 
        at org.apache.helix.zookeeper.zkclient.ZkClient.writeDataGetStat(ZkClient.java:2236) 
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.doSet(ZkBaseDataAccessor.java:365) 
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.set(ZkBaseDataAccessor.java:344) 
        at org.apache.pinot.common.utils.helix.HelixHelper$1.call(HelixHelper.java:146) 
        at org.apache.pinot.common.utils.helix.HelixHelper$1.call(HelixHelper.java:104) 
        at org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:50) 
        at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:104) 
        at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:227) 
        at org.apache.pinot.controller.helix.core.PinotHelixResourceManager.assignTableSegment(PinotHelixResourceManager.java:2265) 
        at org.apache.pinot.controller.api.upload.ZKOperator.processNewSegment(ZKOperator.java:333) 
        at org.apache.pinot.controller.api.upload.ZKOperator.completeSegmentOperations(ZKOperator.java:93) 
        at org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegment(PinotSegmentUploadDownloadRestletResource.java:387) 
        at org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegmentAsMultiPart(PinotSegmentUploadDownloadRestletResource.java:550) 
        at jdk.internal.reflect.GeneratedMethodAccessor402.invoke(Unknown Source) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
        at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) 
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:134) 
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:177) 
        at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$VoidOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:159) 
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:81) 
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478) 
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.lambda$apply$0(ResourceMethodInvoker.java:390) 
        at org.glassfish.jersey.server.ServerRuntime$AsyncResponder$2$1.run(ServerRuntime.java:830)
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
        at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
        at org.glassfish.jersey.server.ServerRuntime$AsyncResponder$2.run(ServerRuntime.java:825)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.lang.InterruptedException
        at java.lang.Object.wait(Native Method) ~[?:?]
        at java.lang.Object.wait(Object.java:328) ~[?:?]
        at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1594) 
        at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1566) 
        at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:2559) 
        at org.apache.helix.zookeeper.zkclient.ZkConnection.writeDataReturnStat(ZkConnection.java:211) 
        at org.apache.helix.zookeeper.zkclient.ZkClient.lambda$writeDataReturnStat$9(ZkClient.java:2221) 
        at org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:1986) 
        ... 37 more
        ```

Worth noting, the helixZK manager may have been shutting down at around the same time. This could explain why the ClientCnxn.java's wait() got interrupted, either way the expectation is for the ZkBaseDataAccessor.set's status being returned to reflect the correct status of the operation.

atch leader session 2000b1610f82230
2023/10/15 21:12:57.666 WARN [ZKHelixManager] [prometheus-http-1-5] Instance pinot-pinot-controller-1.pinot-pinot-controller-headless.managed.svc.cluster.local_9000 is not leader of cluster pinot due to current session 201154a5bc81aaa does not match leader session 2000b1610f82230
2023/10/15 21:13:07.941 INFO [ZKHelixManager] [Thread-57] disconnect Controller_pinot-pinot-controller-1.pinot-pinot-controller-headless.managed.svc.cluster.local_9000(PARTICIPANT) from pinot
2023/10/15 21:13:08.006 INFO [ParticipantHealthReportTask] [Thread-57] Stop ParticipantHealthReportTimerTask
2023/10/15 21:13:08.006 INFO [HelixTaskExecutor] [Thread-57] Shutting down HelixTaskExecutor
2023/10/15 21:13:08.006 INFO [HelixTaskExecutor] [Thread-57] Reset HelixTaskExecutor
2023/10/15 21:13:08.008 INFO [MBeanRegistrar] [Thread-57] MBean ClusterStatus:cluster=pinot,messageQueue=Controller_pinot-pinot-controller-1.pinot-pinot-controller-headless.managed.svc.cluster.local_9000 has been un-registered.
2023/10/15 21:13:08.008 INFO [MBeanRegistrar] [Thread-57] MBean HelixThreadPoolExecutor:Type=USER_DEFINE_MSG has been un-registered.
2023/10/15 21:13:08.008 INFO [HelixTaskExecutor] [Thread-57] Reset executor for msgType: USER_DEFINE_MSG, pool: java.util.concurrent.ThreadPoolExecutor@3a4d0ca2[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
2023/10/15 21:13:08.008 INFO [HelixTaskExecutor] [Thread-57] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@3a4d0ca2[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
2023/10/15 21:13:08.008 INFO [MBeanRegistrar] [Thread-57] MBean HelixThreadPoolExecutor:Type=TASK_REPLY has been un-registered.
2023/10/15 21:13:08.008 INFO [HelixTaskExecutor] [Thread-57] Reset executor for msgType: TASK_REPLY, pool: java.util.concurrent.ThreadPoolExecutor@45fa7fb7[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
2023/10/15 21:13:08.008 INFO [HelixTaskExecutor] [Thread-57] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@45fa7fb7[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
2023/10/15 21:13:08.008 INFO [MBeanRegistrar] [Thread-57] MBean HelixThreadPoolExecutor:Type=STATE_TRANSITION has been un-registered.
2023/10/15 21:13:08.009 INFO [HelixTaskExecutor] [Thread-57] Reset executor for msgType: STATE_TRANSITION, pool: java.util.concurrent.ThreadPoolExecutor@3c83e9ca[Running, pool size = 16, active threads = 0, queued tasks = 0, completed tasks = 16]
2023/10/15 21:13:08.009 INFO [HelixTaskExecutor] [Thread-57] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@3c83e9ca[Running, pool size = 16, active threads = 0, queued tasks = 0, completed tasks = 16]
2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Resetting HelixStateMachineEngine
2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_15
2023/10/15 21:13:08.013 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout
2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_23
2023/10/15 21:13:08.013 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout
2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_12
2023/10/15 21:13:08.013 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout
2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_1
2023/10/15 21:13:08.013 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout
2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_20
2023/10/15 21:13:08.013 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout
2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_18
2023/10/15 21:13:08.013 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout
2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_8
2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout
2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_5
2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout
2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Successfully reset HelixStateMachineEngine
2023/10/15 21:13:08.014 INFO [MBeanRegistrar] [Thread-57] MBean HelixThreadPoolExecutor:Type=STATE_TRANSITION_CANCELLATION has been un-registered.
2023/10/15 21:13:08.014 INFO [HelixTaskExecutor] [Thread-57] Reset executor for msgType: STATE_TRANSITION_CANCELLATION, pool: java.util.concurrent.ThreadPoolExecutor@6c427386[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
2023/10/15 21:13:08.014 INFO [HelixTaskExecutor] [Thread-57] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@6c427386[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Resetting HelixStateMachineEngine
2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_15
2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout
2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_23
2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout
2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_12
2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout
2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_1
2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout
2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_20
2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout
2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_18
2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout
2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_8
2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout
2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_5
2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout
2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Successfully reset HelixStateMachineEngine
2023/10/15 21:13:08.014 INFO [HelixTaskExecutor] [Thread-57]

Any help in helping identifying the root cause / remedial steps will be appreciated thanks.

Any update on this?

This is the client disconnected and got interrupted. It means the thread sharing the zkclient connection called disconnected from zookeeper.