Try harder to failover on recover from master loss
Opened this issue · 1 comments
applike-ss commented
Regarding this:
and this:
We just observed this behavior and in the logs i discovered this error: error running SLAVE OF command: dial tcp 10.138.59.180:9999: i/o timeout
, so i'll assume that either of this happened:
- network issue
- dragonfly main/networking thread blocked
- dragonfly crashed without killing the process
Due to this i would like to suggest the following changes:
- Check via redis client that the operator can talk to the new master before promoting it
- Check via redis client that the operator can talk to the (now) replicas before setting it to slave of new master
- kill the pod if it can't talk to it after X tries (configurable?
0
meaning, do not kill it?)
Pothulapati commented
Thanks @applike-ss for the issue!
All the suggestions seem valid, and are easy enough to implement.