Grokzen/redis-py-cluster

Command keys fail if one of node fail

datasatanic opened this issue · 6 comments

Code to reproduse

from time import sleep

from rediscluster import RedisCluster

startup_nodes = [{'host': '192.168.75.160', 'port': 7000}, {'host': '192.168.75.109', 'port': 7000},
                 {'host': '192.168.75.120', 'port': 7000}]

if __name__ == '__main__':
    redis_cluster = RedisCluster(startup_nodes=startup_nodes)
    keys = {f'test{i}': f'test{i}' for i in range(100)}
    redis_cluster.mset(**keys)
    while True:
# Call DEBUG SEGFAULT for any node
        print(len(redis_cluster.keys()))
        sleep(5)

I know that the command is sent to all nodes. But if one of nodes failed in work, all command raise ConnectionError. Is it possible to return only active node? Or force reload cluster config? How reload config if I use host,port in args, not startup-nodes (If the node specified in the arguments is broken) ?

@datasatanic I need a stack trace for this as i need to see exactly where you get a connection error from. If you get it from the iniitailization process or if you get it from your keys command.
There will be no option to return partial data because if you can't really get to a node in your cluster there is typically some issue. But i will say that the logic of which these multi node commands use is a bit different compared to all other regular commands and it might just be that when you get a connection failure on one of the nodes the client should technically ask for a cluster update in case the master node have moved over to some other IP.
What you can do to manually force the cluster to reinitialize in case you hade a master failover event that is not handled properly is to try/except the keys command and then you run the initialize() method again manually on the RedisCluster object and it should force it to reload in those cases. Look at how ConnectionErrors is handled in _execute_command()

  File "/home/ilya/Documents/Projects/SBP/referencedataloader/api/misc.py", line 72, in reload
    cluster_keys = redis_cluster.keys(f'{prefix}.*')
  File "/home/ilya/miniconda3/envs/referencedataloader/lib/python3.9/site-packages/redis/client.py", line 1661, in keys
    return self.execute_command('KEYS', pattern)
  File "/home/ilya/miniconda3/envs/referencedataloader/lib/python3.9/site-packages/rediscluster/client.py", line 555, in execute_command
    return self._execute_command(*args, **kwargs)
  File "/home/ilya/miniconda3/envs/referencedataloader/lib/python3.9/site-packages/rediscluster/client.py", line 581, in _execute_command
    return self._execute_command_on_nodes(node, *args, **kwargs)
  File "/home/ilya/miniconda3/envs/referencedataloader/lib/python3.9/site-packages/rediscluster/client.py", line 738, in _execute_command_on_nodes
    connection.send_command(*args)
  File "/home/ilya/miniconda3/envs/referencedataloader/lib/python3.9/site-packages/redis/connection.py", line 725, in send_command
    self.send_packed_command(self.pack_command(*args),
  File "/home/ilya/miniconda3/envs/referencedataloader/lib/python3.9/site-packages/redis/connection.py", line 698, in send_packed_command
    self.connect()
  File "/home/ilya/miniconda3/envs/referencedataloader/lib/python3.9/site-packages/redis/connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 192.168.75.120:7000. Connection refused.

What version of this code are you using? what version of redis-py?

redis-py-cluster 2.1.2
redis 3.5.3

Found problem there
rediscluster/client.py

def _execute_command_on_nodes(self, nodes, *args, **kwargs):
    """
    """
    command = args[0]
    res = {}

    for node in nodes:
        connection = self.connection_pool.get_connection_by_node(node)

        # copy from redis-py
        try:
            connection.send_command(*args)
            res[node["name"]] = self.parse_response(connection, command, **kwargs)
        except (ConnectionError, TimeoutError) as e: # ConnectionError: Error 111 connecting to 192.168.75.109:7000. Connection refused
            connection.disconnect()

            if not connection.retry_on_timeout and isinstance(e, TimeoutError):
                raise

            connection.send_command(*args)
            res[node["name"]] = self.parse_response(connection, command, **kwargs) # FAIL There
        except ClusterDownError:
            self.connection_pool.disconnect()
            self.connection_pool.reset()
            self.refresh_table_asap = True

            raise
        finally:
            self.connection_pool.release(connection)

So that piece of code technically does what it is supposed to do. All of these multislot and multinode commands always have been worse implemented then all of the other single slot commands. And due to that i never merged both code paths into a single unified solution then here we are with these kinds of problems. I will argue that i think that sending a command that is supposed to go to all nodes and fails on one of the expected cluster nodes is kinda intended. There is no plans on rebulding this in the 2.x.x version track of this lib so you will have to wait for 3.0 or make some other solution on your own side, or submit a PR that fixes this issue