sgroschupf/zkclient

NullPointerException possibility during reconnection

Closed this issue · 4 comments

It appears that a NullPointerException can be thrown if the ZkClient is being used while a state change requiring a reconnection is being processed.

See https://issues.apache.org/jira/browse/KAFKA-824?focusedCommentId=14019284&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14019284 for example stack traces and a preliminary analysis of the problem.

I'm affected as well by this issue.

I had a look at the code. Think there are only 2 possibilities that such an exception can occur:

    1. a null zkConnection is passed in
    1. a retryUntilConnected action wakes up and the client was closed in meantime

I could reproduce the NPE for case 2 and changed the code to throw an clear exception instead of risking unclear follow up exception like the NPE's.

Thank you for fixing this!

@jzillmann One user is still seeing similar NPEs in ZkConnection.java after upgrading to 0.5. The patch appears to address the race condition around ZkClient.close(), but isn't there still a case where reconnect could cause the Zookeeper reference in ZkConnection to be set to null just before retrying in retryUntilConnected? Unless I'm misunderstanding the code (which is definitely possible), it looks like reconnect is called from the zookeeper event thread, so between the time that ZkConnection.close() and the connection is reestablished in ZkConnection.connect(), what prevents the user from seeing a null Zookeeper in ZkConnection? Maybe I'm missing something?