Race in Cluster.onConnect that leaves cluster in shutdown state indefinitely?
jdanbrown opened this issue · 3 comments
I don't have hard evidence for this, but I think the following is possible:
- Receive
SyncConnected
for sessionid"a"
- Call
onConnect()
- zk session expires here, sessionid was
"a"
- Call
joinCluster()
zk.get().getSessionId
establishes new session with sessionid"b"
and returns"b"
/<name>/nodes/<nodeID>
created with connectionID"b"
, lifetime bound to sessionid"b"
- Complete
joinCluster()
,onConnect()
,SyncConnected
- Receive
SyncConnected
for sessionid"b"
- Call
onConnect()
- Call
previousZKSessionStillActive()
previousZKSessionStillActive()
uses session"b"
to read/<name>/nodes/<nodeID>
, finds connectionID"b"
, returnstrue
onConnect()
returns early, skipping cluster setup
And somewhere in there an Expired
event for sessionid "a"
shuts down the cluster (I'm not sure if it matters when), leaving the cluster in a shutdown state indefinitely because onConnect()
for sessionid "b"
was tricked into skipping cluster setup.
I can now trip this somewhat reliably in my own code (unpublished) by starting a handful of nodes (~5-10) and putting them through a short series of brief zk connection losses.
I think the root problem is that the code sometimes uses separate zk.get()
calls, multiple times in sequence, while assuming that all interactions are in the same zk session, which is unsound because each zk.get()
call can potentially start a new zk session.
Here's one simple example of where this problem can occur:
Cluster.joinCluster
callszk.get().getSessionId
to figure out what sessionId to put in its ephemeral/nodes/<nodeId>
ZKUtils.createEphemeral
callszk.get().create
to create the ephemeral node with the requested data
In the case where the two separate zk.get()
calls return ZooKeeper
instances for different sessions, we end up with a /nodes/<nodeId>
actually created by session B (visible by stat
) that thinks it was created by session A (visible by get
). I think this particular race wouldn't cause the failure outlined above, but there's a zk.get()
race at the core of that one too.
As for a fix, I'm imaging something along these lines:
- Indirect all existing calls to
zk.get()
through a call tocurrentZk()
currentZk()
callszk.get()
except when there's a fixedZooKeeper
instance to tie us to a single session, which we can unobtrusively communicate through a dynamic variablefixedZk
:
def currentZk(): ZooKeeper = fixedZk.value getOrElse zk.get()
val fixedZk = new DynamicVariable[Option[ZooKeeper]](None) // No fixed zk session by default
- To fix a zk session for a particular block of code, wrap the block with
zkSingleSession { ... }
, which we can define as:
def zkSingleSession[X](x: => X): X = fixedZk.withValue(zk.get()) { x }
Then any zkSingleSession { body }
is guaranteed to observe only one ZooKeeper
instance, and thus only one session, and if it ever tries to use a ZooKeeper
whose session has expired, it will trigger a SessionExpiredException
that body
probably can't meaningfully handle, in which case it will abort body
, and the caller of zkSingleSession
will be responsible for deciding whether to fail or retry.
This bug seems to amplify a ginormous design flaw in Twitter's Zookeeper API, that is, creating a method that do two things (retrieve or create and retrieve) instead of one. I proposed a fix some time ago, but job commitments got in my way. Your bug pitch shows this question should be addressed as soon as possible! Maybe we can start at ZKClient and move up the stack?