python-zk/kazoo

command not reusing connections?

fcrespofastly opened this issue · 8 comments

Hey folks!

Thanks for such a great library! This is more a quesiton than an issue, just trying to understand how the library works when it comes to creating connections and such. Here we go:

I'm hitting an issue when connecting to Zokeeper behind an AWS NLB, 3 different AZs and DNS resolves to 3 different IPs, one per each AZ.

The logic is pretty simple and it's executed as part of some acceptance tests where we use zookeeper as the system under test. Zookeeper is deployed as a kubernetes statefulset, where each replica is deployed in a different AWS AZ.

So this is basically the code in question:

@given('zookeeper is exposed behind an ACL-protected LoadBalancer')
def step_impl(context):
    zk_host = context.hostnames['zookeeper']
    zk_retry = zkretry.KazooRetry(max_tries=20, backoff=2, delay=5.0, max_delay=300.0)
    context.zk_client = zkclient.KazooClient(hosts = f'{zk_host}:2181', connection_retry=zk_retry, command_retry=zk_retry)

    def cleanup_zk(context):
        if context.zk_client:
            if context.zk_client.connected:
                context.zk_client.stop()
        context.zk_client.close()
    context.add_cleanup(cleanup_zk, context)

@when('I execute zookeeper command "{cmd}"')
def step_impl(context, cmd):
    context.zk_client.start(timeout=300)
    context.zookeeper_cmd_resp = context.zk_client.command(bytes(cmd, 'UTF-8'))
    context.zk_client.stop()

@then('I get "{response}" back from zookeeper')
def step_impl(context, response):
    assert context.zookeeper_cmd_resp == response

What I see is that, starts eventually succeeds, but command fails with a ConnectionRefusedError and that's where my confusion is coming from:

 context.zookeeper_cmd_resp = context.zk_client.command(bytes(cmd, 'UTF-8'))
  File "/usr/local/lib/python3.10/site-packages/kazoo/client.py", line 725, in command
    sock = self.handler.create_connection(
  File "/usr/local/lib/python3.10/site-packages/kazoo/handlers/threading.py", line 178, in create_connection
    return utils.create_tcp_connection(socket, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/kazoo/handlers/utils.py", line 265, in create_tcp_connection
    sock = module.create_connection(address, timeout_at)
  File "/usr/local/lib/python3.10/socket.py", line 845, in create_connection
    raise err
  File "/usr/local/lib/python3.10/socket.py", line 833, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

Captured logging:
INFO:kazoo.client:Connecting to zookeeper.example.com(<IP_AZ_1>):2181, use_ssl: False
WARNING:kazoo.client:Connection dropped: socket connection error: Connection refused
INFO:kazoo.client:Connecting to zookeeper.example.com(<IP_AZ_2>):2181, use_ssl: False
INFO:kazoo.client:Zookeeper connection established, state: CONNECTED

So it's kind of weird, the connection might fail maybe because one of the NLB interface is not ready yet, so it retries, resolves the host to a different IP, and connection succeeds. But:

Why command() is creating a connection at all if the connection is already created when calling start() ? Shouldn't command reuse a connection from one of the handlers threads? Why command() is not retrying if in the client command_retry is set?

Thanks!!

ceache commented

Thanks for your response @ceache

Not an expert on Zookeeper here, but I think my questions still stand.

The command method is part of the kazoo.client module and it requires you to start (another instance method of kazoo.client) the connection, otherwise ConnectionLost is raised. But still both start and command create a new connection (TCP socket) to the host specified when constructing the kazoo.client.KazooClient object, so my question is why creating a new connection if calling start is required and already connects to Zookeeper?

Hi, The commands, aka 4lw, is a monitoring/debugging interface that is bolted on the same zookeeper client port, but they are not part of the client - server binary protocol. They just happen to use the same port.

I don't know what do you mean about the client port, probably lack of knowledge in zookeeper land, but as far as I know, TCP sockets are used so the first available port on the client side is what's going to be used. If I just telnet zookeeper I can send any command and I noticed that they close the socket afterwards yep.

In the mean time, you could easily work around this problem by looking at which server your client is connected to and establishing the connection yourself.

I don't think I can do that. This is part of an automation workflow, I can just replace kazoo and just use raw TCP sockets or nc commands as I was using previously.

ceache commented

I double checked the code and it seems to be doing the "right thing"

https://github.com/python-zk/kazoo/blob/master/kazoo/client.py#L760

It create a new TCP socket connection to the currently connected server and sends the command there. As mentioned in my previous message, the Zookeeper command interface is 1 command per connection, closed by the server and cannot be mixed with the regular client/server OpCodes.

I am not sure what you are experiencing, would you be able to run a network capture (Wireshark, tcpdump,) while testing and share it with us? I would particularly like to see the client connection and the "connection refused" while sending the command

Thanks!

Hey, I'll try to see if I can capture the traffic.

I'm still not sure why it's required to start the client if it's opening a new TCP socket. I'm not sure if I'm explaining myself properly here.

It's fine if kazoo/zookeeper requires opening a new socket per command, that's totally OK to me.

What I don't understand is why then in the KazooClient.command() method requires to call KazooClient.start(), see the ConnectionLoss exception being raised if the background threads that start() creates don't set _live when creating the connection. I guess it's only used as a healthcheck/wait for readiness? so command won't open a new socket until the background threads started by start() don't determine zookeeper is ready (by creating their own persistent TCP sockets)

Thanks for looking into this!

ceache commented

Thanks again @ceache and sorry to be a total pain if I'm not getting this correctly.

More than requesting a feature or a change in the code, I was trying to understand why command is behaving this way:

  1. Check if there's any connection "alive" (as the client called start() first) by checking self._live.is_set()
  2. Create a socket with the destination being the peer from that "alive" connnection.

So my question was, why creating a new socket (2) if there's already a live socket created by start() and checked by command() in step 1.

Here's the official docs around these "4 letter word" commands https://zookeeper.apache.org/doc/r3.8.1/zookeeperAdmin.html#sc_4lw

Important being the:

You issue the commands to ZooKeeper via telnet or nc, at the client port.

which from my understanding is dissimilar to the usual zookeeper protocol which is why we create a new socket.

An extra confirmation for the requirement of the new socket can be found in the Zookeeper server code. You can see that the checkFourLetterWord check is only performed when the connection is not initialized which is set after a correct connect message is sent by the client over the existing socket, (which is what we do when we "start" a Zookeeper client).

Also, with regards to

and checked by command() in step 1.

I can't seem to find any place where we call client.command(...) directly, except for the server_version method which we again don't call directly in the client. The _live event is set when a connection is completed successfully and the session callback is called (see here)

Hope this clarifies things.

I can't seem to find any place where we call client.command(...) directly, except for the server_version method which we again don't call directly in the client. The _live event is set when a connection is completed successfully and the session callback is called (see here)

It's me in my testing code the one calling client.command() and noticed that it's creating a new socket.

Understood, thanks!