dragonflydb/dragonfly-operator

Try harder to failover on recover from master loss

Opened this issue · 1 comments

Regarding this:

// TODO: Why does this fail every now and then?

and this:

// Should replication be continued if it fails?

We just observed this behavior and in the logs i discovered this error: error running SLAVE OF command: dial tcp 10.138.59.180:9999: i/o timeout, so i'll assume that either of this happened:

  • network issue
  • dragonfly main/networking thread blocked
  • dragonfly crashed without killing the process

Due to this i would like to suggest the following changes:

  • Check via redis client that the operator can talk to the new master before promoting it
  • Check via redis client that the operator can talk to the (now) replicas before setting it to slave of new master
  • kill the pod if it can't talk to it after X tries (configurable? 0 meaning, do not kill it?)

Thanks @applike-ss for the issue!

All the suggestions seem valid, and are easy enough to implement.