sgroschupf/zkclient

Cannot close zkclient from a listener

Opened this issue · 9 comments

On a recent project, we wanted to close and refresh zkclient when the connection to the server was disrupted. We were unable to do so because of the mechanics of calling .close() from within the event thread context. I found an issue on the apache helix jira that goes into more detail, which helped inform a workaround of running the close/refresh code within a new thread started by IZkStateListener.handleStateChanged as opposed to just trying to run the code within that method. The JIRA ticket is here: https://issues.apache.org/jira/browse/HELIX-264

Is this desirable behavior? If not, how should the code change to accommodate closing zkclient from within the IZkStateListener?

The sledgehammer approach might be to run all handler code within its own thread, effectively wrapping the following line in a thread and calling start on that thread: https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkEventThread.java#L71

Another approach may be to do some special handling within the close method itself, potentially executing that code within a separate thread context to avoid the potential of ever calling it from within the event thread context.

I believe either approach would complicate exception handling, so any suggestions or advice from more familiar eyes would be awesome.

Hey @freethejazz , so why you want to do that. The ZkClient reconnecting itself when it is getting an expired event. You want that for different events as well ?

We're currently using Zookeeper from an autoscaling group within AWS. As a result, the IPs of the Zookeeper nodes may change.

Current lifecycle is along the lines of:
Application Starts -> Get IP information from load balanced Zookeeper nodes.
If we experience an outage and AWS bounces the ZK boxes, they have all new IPs.
ZK Client tries to reconnect to IPs that are no longer valid.
Instead, we want to get the list of IPs from Exhibitor again.

Right now, our best way of determining this kind of failure is by using IZkStateListener.

As for your second question, we don't have any current need for this sort of think in the Child or Data listeners.

Does that help clarify the issue?

So if i get everything right your IZkStateListener can somehow identify that the ip's are not longer valid and then wants to close the current zkClient and re-opens it for another zookeeper server!?

Instead of

re-opens it for another zookeeper server

it would be

re-opens it for another set of zookeeper server host/port combinations

I think you have the right idea, though. Based on what I've done as a workaround, there may be deeper issues to resolve for our use case, but this would be a start.

Ok, so could you have a look into ZkClient#close() method and and wrap the _eventThread.join() method in this iff call:

if (_eventThread != Thread.currentThread()) {
    _eventThread.join(2000);
}

Can you test if that helps your problem ?

@jaikiran @freethejazz

Does this solution solve the problem?

if (_eventThread != Thread.currentThread()) {
_eventThread.join(2000);
}

@wlchou Apologies, but I never tried it. I originally filed this issue after we had abandoned this approach for our use case and somehow missed the notification with @jzillmann's fix and request to test. I'm not sure if this helps at all, but the initial reasoning for filing this issue was because dynamic cluster management was not possible in ZooKeeper. At the time, it had been an issue on the radar for around 6 years and had made it into 3.5.0-alpha, but the release cycle was so slow we couldn't wait for the feature to be available. Nine months later, 3.5.0 is still alpha. Here's the issue we watched for the feature we needed: https://issues.apache.org/jira/browse/ZOOKEEPER-107, and here is the documentation for the new feature that's been on the shelf since late 2014: https://zookeeper.apache.org/doc/trunk/zookeeperReconfig.html

I know it isn't an appropriate solution in many cases, but I would strongly suggest looking elsewhere to solve your distributed consensus needs. etcd and consul have their problems as well, but they are at least active projects that seem to have a future. Also please note that the suggestion to migrate off of ZooKeeper has nothing to do with the sgroschupf/zklient project, but with the ZooKeeper project itself.

Sorry I couldn't be of more help,
Jonathan

@wlchou Of course, moments after posting this I remember what we did. Our ZK cluster was in AWS, so we put an Elastic Load Balancer in front of it and changed the client connection string to point to the load balancer. If the connection failed with the initially resolved ZK instance, it would try to connect again to the hosts defined in the connection string, which was just the hostname of the ELB. The ELB would then deliver an IP that was healthy and running, and the client would reconnect. See the FAQ here: http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A8

@wlchou @jzillmann I encountered the same issue. How can this problem be resolved?