Operator take no decision when connection is lost
Closed this issue · 5 comments
Hi,
The 3 August, GCP europe-west9-a zone goes down, our master was on this zone and others replicas on europe-west9-b/europe-west9-c. So we have 3 pods, perfect for "quorum", but dragonfly-operator take no decision about switching master.
Timeline:
- 13:29 GCP europe-west9-a goes down (Node status : Unreachable) and so, Master pod with it
- 13:29 Dragonfly-operator loop this message: Master pod is not ready yet, will requeue
- ...
- 13:36 Manual action : Delete the master pod
I think, this is unbelievable about operator to take no decision about this situation, the connection was lost since many minutes, so we need to promote another healthy replica to master ?
Hi @SoGooDFR, sorry for the incident. This should not happen as we patched a fix for this in v1.1.3. What is the version you are using?
From the log message you shared it seems like you are using >=v1.1.3. Currently, we do failover if master tries to restart (in your case the node itself got down). So, the failover unfortunately wasn't triggered. We need to strengthen our health check and failover logic so this may never happen again. I will fix it asap. Again sorry for the incident.
Hey @Abhra303, Any update on this?
We also found it out while evaluating dragonfly and are now blocked from using it.
If you believe there will not be a fix soon, we will look for something else.
Thanks
Hi @orenhecht, We will patch a release next week.