python-zk/kazoo

self._attempts of KazooRetry will nerver grow up more than 1 when ConnectionDropped exception occur repeatedly in _connect_attempt

thenumberouscode opened this issue · 4 comments

Expected Behavior

when I set max_tries = 3 , I found that the retry still can be infinity. Does I misunderstand the meaning of max_tries?
I think self._attempts should be grow up when retry actually occur.

Actual Behavior

now self._attempts is always change between 0 and 1 when ConnectionDropped exception occur repeatedly in _connect_attempt.

Snippet to Reproduce the Problem

COPY/PASTE the snippet here (omit any sensitive information)

Logs with logging in DEBUG mode

COPY/PASTE the result of the snippet here (omit any sensitive information)

Specifications

  • Kazoo version: 2.9.0
  • Result of pip list command:
  • Zookeeper version:
  • Zookeeper configuration: put here any useful ZK configuration (authentication, encryption, number of ZK members, number of (concurrent?) clients, Java version, krb5 version, etc.)
  • Python version:
  • OS:

I handwrite a ConnectionDroped exception in connection.py line 587
Snipaste_2022-12-23_13-31-39

then I set max_tries=3, the log indicated that client retried many times.
KazooClient(self.hosts, auth_data=[("digest", self.auth)], connection_retry={"max_tries":"3", "delay": 0.1, "backoff": 2, "max_delay": 3600}, logger=logger)
Snipaste_2022-12-23_20-34-24

Have you noticed this happening in the wild?

If the server was in a bad state, self._connect would fail which is only a few lines above.

Looking at the code, this would only cause this pathological case only if the server would always respond successfully to a connect but fail on a ping.

I guess the assumption would be that after a connection drop in the ping loop, we want to immediately reconnect after which either the problem got solved or the _connect call would throw which happens before the retry.reset() call that results in the status you notice.

IMO I think this assumption holds for the lifetime of a client, especially considering that failure in a _connect_attempt would result in the next connection being attempted on the next host in the list, which hopefully has a high chance of succeeding.

Yes, I think your understanding of the code logic is reasonable, the issue can be closed.