boundary/ordasity

Race in Cluster.onConnect that leaves cluster in shutdown state indefinitely?

jdanbrown opened this issue · 3 comments

I don't have hard evidence for this, but I think the following is possible:

And somewhere in there an Expired event for sessionid "a" shuts down the cluster (I'm not sure if it matters when), leaving the cluster in a shutdown state indefinitely because onConnect() for sessionid "b" was tricked into skipping cluster setup.

I can now trip this somewhat reliably in my own code (unpublished) by starting a handful of nodes (~5-10) and putting them through a short series of brief zk connection losses.

I think the root problem is that the code sometimes uses separate zk.get() calls, multiple times in sequence, while assuming that all interactions are in the same zk session, which is unsound because each zk.get() call can potentially start a new zk session.

Here's one simple example of where this problem can occur:

In the case where the two separate zk.get() calls return ZooKeeper instances for different sessions, we end up with a /nodes/<nodeId> actually created by session B (visible by stat) that thinks it was created by session A (visible by get). I think this particular race wouldn't cause the failure outlined above, but there's a zk.get() race at the core of that one too.

As for a fix, I'm imaging something along these lines:

  1. Indirect all existing calls to zk.get() through a call to currentZk()
  2. currentZk() calls zk.get() except when there's a fixed ZooKeeper instance to tie us to a single session, which we can unobtrusively communicate through a dynamic variable fixedZk:
def currentZk(): ZooKeeper = fixedZk.value getOrElse zk.get()
val fixedZk = new DynamicVariable[Option[ZooKeeper]](None) // No fixed zk session by default
  1. To fix a zk session for a particular block of code, wrap the block with zkSingleSession { ... }, which we can define as:
def zkSingleSession[X](x: => X): X = fixedZk.withValue(zk.get()) { x }

Then any zkSingleSession { body } is guaranteed to observe only one ZooKeeper instance, and thus only one session, and if it ever tries to use a ZooKeeper whose session has expired, it will trigger a SessionExpiredException that body probably can't meaningfully handle, in which case it will abort body, and the caller of zkSingleSession will be responsible for deciding whether to fail or retry.

This bug seems to amplify a ginormous design flaw in Twitter's Zookeeper API, that is, creating a method that do two things (retrieve or create and retrieve) instead of one. I proposed a fix some time ago, but job commitments got in my way. Your bug pitch shows this question should be addressed as soon as possible! Maybe we can start at ZKClient and move up the stack?