Operator take no decision when connection is lost

Question

Operator take no decision when connection is lost

Closed this issue 4 months ago · 5 comments

Hi,

The 3 August, GCP europe-west9-a zone goes down, our master was on this zone and others replicas on europe-west9-b/europe-west9-c. So we have 3 pods, perfect for "quorum", but dragonfly-operator take no decision about switching master.

Timeline:

13:29 GCP europe-west9-a goes down (Node status : Unreachable) and so, Master pod with it
13:29 Dragonfly-operator loop this message: Master pod is not ready yet, will requeue
...
13:36 Manual action : Delete the master pod

I think, this is unbelievable about operator to take no decision about this situation, the connection was lost since many minutes, so we need to promote another healthy replica to master ?

Answer 1 · 2024-08-08T07:04:28.000Z

Hi @SoGooDFR, sorry for the incident. This should not happen as we patched a fix for this in v1.1.3. What is the version you are using?

Answer 2 · 2024-08-08T07:26:51.000Z

From the log message you shared it seems like you are using >=v1.1.3. Currently, we do failover if master tries to restart (in your case the node itself got down). So, the failover unfortunately wasn't triggered. We need to strengthen our health check and failover logic so this may never happen again. I will fix it asap. Again sorry for the incident.

Answer 3 · 2024-08-18T12:14:56.000Z

Seems same issue as #227 ?

Answer 4 · 2024-08-21T11:02:28.000Z

Hey @Abhra303, Any update on this?

We also found it out while evaluating dragonfly and are now blocked from using it.

If you believe there will not be a fix soon, we will look for something else.

Thanks

Answer 5 · 2024-08-21T13:02:37.000Z

Hi @orenhecht, We will patch a release next week.