Unreachable node gets selected everytime causing TTL exhausted
tahamr83 opened this issue · 2 comments
Summary
We have a redis cluster that runs on kubernetes, after one of our nodes crashed the StatefulSet created a new pod rendering the POD IP of the redis node unroutable. The redis-py client was for some reason unable to obtain the new state of the cluster and we see a lot of TTL exhausted errors.
Even if the dead unroutable redis node is selected, shouldn't it recover the cluster state and remove the dead node from it's list? Instead what we see is all 16 TTL attempts select the exact same node and finally we see a TTL Exhausted
error.
[2021-10-01 13:35:36 DEBUG rediscluster.client TaInZR2tRRDy_Bi3hLmSbw..] - TTL loop : 15
[2021-10-01 13:35:36 DEBUG rediscluster.client TaInZR2tRRDy_Bi3hLmSbw..] - Determined node to execute : {'host': '10.244.7.123', 'port': 6379, 'name': '10.244.7.123:6379', 'server_type': 'master'}
[2021-10-01 13:35:39 ERROR rediscluster.client TaInZR2tRRDy_Bi3hLmSbw..] - ConnectionError
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 559, in connect
sock = self._connect()
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 615, in _connect
raise err
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 603, in _connect
sock.connect(socket_address)
OSError: [Errno 113] No route to host
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/rediscluster/client.py", line 630, in _execute_command
connection.send_command(*args)
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 726, in send_command
check_health=kwargs.get('check_health', True))
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 698, in send_packed_command
self.connect()
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 563, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 113 connecting to 10.244.7.123:6379. No route to host.
[2021-10-01 13:35:36 DEBUG rediscluster.client TaInZR2tRRDy_Bi3hLmSbw..] - TTL loop : 3
[2021-10-01 13:35:36 DEBUG rediscluster.client TaInZR2tRRDy_Bi3hLmSbw..] - Determined node to execute : {'host': '10.244.7.123', 'port': 6379, 'name': '10.244.7.123:6379', 'server_type': 'master'}
[2021-10-01 13:35:39 ERROR rediscluster.client TaInZR2tRRDy_Bi3hLmSbw..] - ConnectionError
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 559, in connect
sock = self._connect()
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 615, in _connect
raise err
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 603, in _connect
sock.connect(socket_address)
OSError: [Errno 113] No route to host
Why does the client select the dead 10.244.7.123:6379
node every time instead of trying another live random node and then fetching the new cluster state?
Even if the dead unroutable redis node is selected, shouldn't it recover the cluster state and remove the dead node from it's list? Instead what we see is all 16 TTL attempts select the exact same node and finally we see a
TTL Exhausted
error.
I am guessing here but the issue is probably that your redis-cluster is not booting out the node that you want because this client is coded so that it will only use what the redis server says is the current cluster state, if your node is not booted out it will remain even if the node is not reachable, that is just the reference logic a client should implement.
You probably can verify this by both checking what cluster info, cluster slots & cluster nodes is returning out when a node drops and try to monitor the master nodes how long it takes for the cluster consensus to reach a new consensus that eventually will propagate to all clients.
Also note that if you check the client code in the ConnectionError section here https://github.com/Grokzen/redis-py-cluster/blob/master/rediscluster/client.py#L647 you will see down here https://github.com/Grokzen/redis-py-cluster/blob/master/rediscluster/client.py#L660 that the code shuold attempt a full node table refresh after 5 connection errors and do a full reinitialize of the cluster state and then it circles back to the initial point that your cluster has not reached a new cluster consensus and that is the root cause of your issue.
Thank you so much for your analysis. Apologies for opening an unnecessary issue.
This seems to be a cluster problem rather than the client issue.