linkedin/venice

[BUG] Before deleting push status node in LeakedResourceCleanUpService, stop monitoring for it first

ZacAttack opened this issue · 1 comments

Willingness to contribute

Yes. I can contribute a fix for this bug independently.

Venice version

0.4.139

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 20.0): Mariner 5.15.111.1-1.cm2
  • JDK version: 17

Describe the problem

The current behavior in LeakedPushStatusCleanUpService is that:

  1. If the resource is also leaking in Helix, controller will delete the Helix resource first, which should trigger the STs that drop resources on server instances and eventually clean up the push status ZNode
  2. otherwise, it will delete the push status Znode directly.

We should improve case (2); stop monitoring in the controller first, before deleting the push status ZNode; then there won't be excessive error logs.

Tracking information

2023/05/01 00:22:11.771 WARN [ZkClient] [Venice-Admin-Execution-Task-t1] [venice-controller-war] [] zkclient 3, Failed to delete path /venice-13/OfflinePushes/HB_VPJtarget_prod-venice-13_v10816! 
org.apache.helix.zookeeper.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /venice-13/OfflinePushes/HB_VPJtarget_prod-venice-13_v10816
	at org.apache.helix.zookeeper.zkclient.exception.ZkException.create(ZkException.java:72) ~[helix-common-1.0.4.jar:?]
	at org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:2000) ~[helix-common-1.0.4.jar:?]
	at org.apache.helix.zookeeper.zkclient.ZkClient.delete(ZkClient.java:2058) [helix-common-1.0.4.jar:?]
	at org.apache.helix.manager.zk.ZkBaseDataAccessor.remove(ZkBaseDataAccessor.java:727) [helix-core-1.0.4.jar:1.0.4]
	at com.linkedin.venice.utils.HelixUtils.remove(HelixUtils.java:210) [venice-common-0.4.58.jar:?]
	at com.linkedin.venice.utils.HelixUtils.remove(HelixUtils.java:204) [venice-common-0.4.58.jar:?]
	at com.linkedin.venice.helix.VeniceOfflinePushMonitorAccessor.deleteOfflinePushStatusAndItsPartitionStatuses(VeniceOfflinePushMonitorAccessor.java:186) [venice-common-0.4.58.jar:?]
	at com.linkedin.venice.pushmonitor.AbstractPushMonitor.cleanupPushStatus(AbstractPushMonitor.java:538) [venice-controller-0.4.58.jar:?]
	at com.linkedin.venice.pushmonitor.AbstractPushMonitor.stopMonitorOfflinePush(AbstractPushMonitor.java:251) [venice-controller-0.4.58.jar:?]
	at com.linkedin.venice.pushmonitor.PushMonitorDelegator.stopMonitorOfflinePush(PushMonitorDelegator.java:117) [venice-controller-0.4.58.jar:?]
	at com.linkedin.venice.controller.VeniceHelixAdmin.stopMonitorOfflinePush(VeniceHelixAdmin.java:6282) [venice-controller-0.4.58.jar:?]
	at com.linkedin.venice.controller.VeniceHelixAdmin.deleteOneStoreVersion(VeniceHelixAdmin.java:2874) [venice-controller-0.4.58.jar:?]
	at com.linkedin.venice.controller.VeniceHelixAdmin.deleteOneStoreVersion(VeniceHelixAdmin.java:2851) [venice-controller-0.4.58.jar:?]
	at com.linkedin.venice.controller.VeniceHelixAdmin.retireOldStoreVersions(VeniceHelixAdmin.java:2975) [venice-controller-0.4.58.jar:?]
	at com.linkedin.venice.controller.VeniceHelixAdmin.addVersion(VeniceHelixAdmin.java:2406) [venice-controller-0.4.58.jar:?]
	at com.linkedin.venice.controller.VeniceHelixAdmin.addVersionAndStartIngestion(VeniceHelixAdmin.java:1703) [venice-controller-0.4.58.jar:?]
	at com.linkedin.venice.controller.kafka.consumer.AdminExecutionTask.handleAddVersion(AdminExecutionTask.java:626) [venice-controller-0.4.58.jar:?]
	at com.linkedin.venice.controller.kafka.consumer.AdminExecutionTask.processMessage(AdminExecutionTask.java:223) [venice-controller-0.4.58.jar:?]
	at com.linkedin.venice.controller.kafka.consumer.AdminExecutionTask.call(AdminExecutionTask.java:124) [venice-controller-0.4.58.jar:?]
	at com.linkedin.venice.controller.kafka.consumer.AdminExecutionTask.call(AdminExecutionTask.java:67) [venice-controller-0.4.58.jar:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /venice-13/OfflinePushes/HB_VPJtarget_prod-venice-13_v10816
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:132) ~[zookeeper-3.7.1.jar:3.7.1]
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[zookeeper-3.7.1.jar:3.7.1]
	at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:1670) ~[zookeeper-3.7.1.jar:3.7.1]
	at org.apache.helix.zookeeper.zkclient.ZkConnection.delete(ZkConnection.java:144) ~[helix-common-1.0.4.jar:?]
	at org.apache.helix.zookeeper.zkclient.ZkClient$10.call(ZkClient.java:2062) ~[helix-common-1.0.4.jar:?]
	at org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:1986) ~[helix-common-1.0.4.jar:?]
	... 22 more

Code to reproduce bug

No response

What component(s) does this bug affect?

  • Controller: This is the control-plane for Venice. Used to create/update/query stores and their metadata.
  • Router: This is the stateless query-routing layer for serving read requests.
  • Server: This is the component that persists all the store data.
  • VenicePushJob: This is the component that pushes derived data from Hadoop to Venice backend.
  • VenicePulsarSink: This is a Sink connector for Apache Pulsar that pushes data from Pulsar into Venice.
  • Thin Client: This is a stateless client users use to query Venice Router for reading store data.
  • Fast Client: This is a stateful client users use to query Venice Server for reading store data.
  • Da Vinci Client: This is an embedded, stateful client that materializes store data locally.
  • Alpini: This is the framework that fast-client and routers use to route requests to the storage nodes that have the data.
  • Samza: This is the library users use to make nearline updates to store data.
  • Admin Tool: This is the stand-alone client used for ad-hoc operations on Venice.
  • Scripts: These are the various ops scripts in the repo.

Gonna close this one. After talking with @nisargthakkar, it seems the behavior here in the helix library is that it will actually delete the zk path, but it logs this pretty confusing message about not being able to delete it. We've filed the bug instead against helix and we'll close this one for now as it doesn't seem to be something we can fix on our end entirely.