Full Cluster Shutdown of all k8s-nodes results redis-cluster pods in Error State

Question

Full Cluster Shutdown of all k8s-nodes results redis-cluster pods in Error State

TANISH-18 opened this issue 2 years ago · 10 comments

Steps to reproduce the issue:
Redis-operator and redis-cluster up and running .
shutoff all K8 instances from openstack and startinstance after 5 mins .
Redis cluster pods are stuck in error state.
once all k8 nodes comes back, Redis-operator is up and running and it is able to bring up only 1 Redis-cluster pod.

There are two issues once all K8 cluster nodes came up.

Redis operator is not able to delete the pods stuck in error state.
Redis operator is only able to bring up 1 cluster pod. from the logs we can see after Reconciling one cluster pod it is finishing the Reconciling and also setting the CRO state OK.

kc get pods -n fed-redis-cluster -o wide

rediscluster-node-for-redis-9fhdz 0/2 Error 2 91m
rediscluster-node-for-redis-hj59b 0/2 Error 2 93m
rediscluster-node-for-redis-j69st 0/2 Error 0 37m
rediscluster-node-for-redis-w2m79 2/2 Running 0 36m

This is issue from Redis operator side. it is seen while performing the Resiliency Run for power outage case.

Answer 1 · 2022-11-15T12:39:49.000Z

@cin can you please take a look?

Answer 2 · 2022-11-15T16:42:12.000Z

Thanks for the issue. This looks like a bug @TANISH-18. I don't currently have a openstack cluster to test against. Maybe there's a way to reproduce the issue w/out openstack. So you basically shutdown all the worker nodes and then brought them back up to get into this state? Do you see anything in the operator logs related to the issue? Also, what is the actual error on the pods? Thanks in advance!

Answer 3 · 2022-11-15T17:20:44.000Z

So you basically shutdown all the worker nodes and then brought them back up to get into this state?

yes correct.
Do you see anything in the operator logs related to the issue?
yes one thing i notice is it trying to reconnect with the pods that are stuck in error state.
I1115 15:42:30.128018 7 connections.go:120] Cannot connect to 192.168.190.20:6379
I1115 15:42:30.128050 7 connections.go:196] Can't connect to 192.168.190.20:6379: dial tcp 192.168.190.20:6379: i/o timeout
I1115 15:42:30.131949 7 clusterinfo.go:163] Temporary inconsistency between nodes is possible. If the following inconsistency message persists for more than 20 mins, any cluster operation (scale, rolling update) should be avoided before the message is gone
I1115 15:42:30.131978 7 clusterinfo.go:164] Inconsistency from 192.168.190.54:6379:

It looks like it is trying to connect with pods in the error state. but it did not try to delete that pods or spawn the new pods. but one thing i observed as soon as i delete the redis pods stuck in error state by using kubectl delete commands new pods are getting spawned and all the pods were alive.

what is the actual error on the pods?

redis cluster pod status is in failed state and reason is Terminated.
Status: Failed
Reason: Terminated

Answer 4 · 2022-11-15T17:25:44.000Z

Oh, it sounds like the operator is not recognizing the failed state and is just waiting for the pods to "fix themselves" (which they won't in this case bc the pods aren't managed by any type of replicaset -- they're managed by the operator). I'll see if I can reproduce locally w/kind.

Answer 5 · 2022-11-16T13:59:18.000Z

Hey @cin, did you get any chance to reproduce the issue ?

Answer 6 · 2022-11-16T16:34:12.000Z

Sorry I got pulled into some other things yesterday and didn't get a chance to test. Will make some time today.

Answer 7 · 2022-11-16T21:59:07.000Z

@TANISH-18, I was able to try this in one of our clusters (there's no good way to "restart" nodes in kind). I just rebooted all 3 nodes in the cluster at once. The pods all restarted and came back up fine. I think the main difference w/what I just did is that the pods were still there and were just restarted. I'm guessing your pods were recreated? Here's how my pods look after the reboot.

op-operator-for-redis-65d5f6fc78-qlgtg       1/1     Running   1 (8m22s ago)   27m   172.30.142.155   10.209.206.175
rc-node-for-redis-metrics-586bbb87c8-dtz42   1/1     Running   1 (8m22s ago)   26m   172.30.142.149   10.209.206.175
rediscluster-rc-node-for-redis-5gp9t         2/2     Running   2 (8m22s ago)   25m   172.30.142.152   10.209.206.175
rediscluster-rc-node-for-redis-cf479         2/2     Running   2 (8m46s ago)   26m   172.30.45.133    10.185.151.142
rediscluster-rc-node-for-redis-r94nt         2/2     Running   2 (8m35s ago)   26m   172.30.121.8     10.38.252.103

Did your operator or redis pods ever restart? Am I correct in thinking you got your cluster back to a healthy state by deleting the errored out pods? The prospect of automating that is a bit scary -- not because it's hard but because it's deleting resources. Since you've effectively lost all cache at that point, you can just reinstall the CR as well (easier than deleting all pods). That's not a great answer if you want to go autopilot on disaster recovery though. In the meantime, I'll see if I can get an openshift cluster approved.

Answer 8 · 2022-11-17T04:17:54.000Z

@cin yes my operator pod restarts. deleting all the redis pods will not reproduce this issue. In that case even mine redis cluster pods came back. but the issue is with deleting all the K8 nodes from openstack and then start it after 5-10mins. basically the power outage case.

Anyways I got the fix. actually during reconciling operator is trying to connect with pods in error state. so we need to delete failed pods while polling for failed Redis pods. so once we delete all the pods in failed state. Redis cluster pods will come back.

Answer 9 · 2022-11-17T17:11:30.000Z

UPDATE: I went a step farther and deleted the only worker pool in my test cluster. The redis node pods completely went away as expected. The operator and metrics pods went into a pending state (also expected). I then recreated a new worker pool and everything came back w/out incident. I'm starting to think this is an openshift thing only. I wonder what's different...

Answer 10 · 2023-01-13T16:19:43.000Z

@TANISH-18 you may want to try out the latest version of the operator as #84 may have resolved this issue as well.