sewenew/redis-plus-plus

[BUG] socket_timeout parameter has no effect if the link is broken between redis instance and the client. Outage towards a slot range for several minutes.

Opened this issue · 0 comments

Describe the bug
socket_timeout parameter has no effect if the link is broken between redis instance and the client. Outage towards a slot range for several minutes.

To Reproduce

  • Async client is used with cluster mode. Redis cluster is used with 3 masters and 3 slaves.
  • generate continuous traffic
  • during the traffic, select a master redis instance, apply the following iptables rule in the container/vm of that server:
    iptables -A OUTPUT -p tcp --sport redis_port -s redis_ip -j DROP
  • kill that redis server (kill -9 redis_server_pid) [-> new master election will happen for that slot range]

Expected behavior
After socket_timeout, redis-plus-plus discover the new elected master / broken connection, traffic is redirected to that master

Unexpected result: old connection is used, continuous TCP packet retransmissions, no response to users for the given slot range for several minutes

Environment:
OS: Rocky Linux 8.2-20.el8.0.1
Compiler: gcc version 8.5.0
hiredis version: hiredis 1.2.0
redis-plus-plus version: 1.3.12

Additional context
Correction proposal:

  • introduce a new property per connection: last_response_received
  • failure detection: whenever a request is sent in a connection, check if (t_now - last_response_received) is under socket_timeout + TOLERATION (some milliseconds).
  • when the failure is detected, be careful with CLUSTER SLOTS, do not target the problematic master instance, select another (I see issues like that)
  • check if mastership was changed, and reconnect to the new master if needed. Abort the ongoing requests towards the redis-plus-plus user, redis-plus-plus user should retransmit the request, maybe with some delay.

As a workaround, we started guard timer and reset the AsyncRedisClient at timeouts to force redis-plus-plus to discover the mastership changes / broken links. But this solution also caused some issues, see:
#577
#578