mediocregopher/radix

if sentinel redis master node is killed but not removed from sentinels list then redis connections will be broken

Closed this issue · 1 comments

Problem

If redis sentinel master node goes offline that way that it is not properly removed then next things happen:

  1. happens failover to new master what is correct
  2. previous redis master is downgraded to slave

All this happens at redis side and all is good there.

Problem comes from the fact that this broken redis slave is not in that case removed from the sentinels list. Instead it actually get's flag s_down,slave.

This flag is ignored by the library and what actually happens is that the library fails to promote new master as it still tries to connect to broken slave and it errors.

Atleast it happened in our production Kubernetes cluster when one Kubernetes node/machine died totally. Library started to get errors no route to host what makes totally sense. Only way to recover from it was to remove manually slave from the sentinels list from each separately using SENTINEL RESET command.

Also in that case service can't be restarted as new instances will not come up because of the same reason. Although actually in that case one healthy master and slave existed and app could function normally.

Error happens here:

client, err := sc.client(ctx, addr)

Possible fix is to exclude all slaves what has s_down status. As far we tested, it fixed the issue.
Proposed fix:
#343

Thank you @mrkmrtns , I've merged the PR and will tag a new release shortly :)