openshift/cluster-etcd-operator

clustermembercontroller redesign

alaypatel07 opened this issue · 2 comments

It is hard to reason about how the state transitions are happening currently. The workflow from no cluster to bootstrap complete is well understood but we need to start thinking about how the operator will respond on a simple static pod reboot (for instance during an upgrade).

A good path forward would be a design doc to brainstorm ideas and viable options on how the state transitions look like.

example of a hard to understand upgrade run, https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/68/pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp-upgrade/181/artifacts/e2e-gcp-upgrade/pods/openshift-etcd-operator_etcd-operator-9c6dc968f-jqpnv_operator.log

it can be seen that the etcd members are running, but one can see interim errors like

{"level":"warn","ts":"2020-01-30T15:01:02.044Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-4faa3884-4fba-4b60-aa9d-1efcc12d44ea/etcd-0.ci-op-nzflhv64-e4498.origin-ci-int-gce.dev.openshift.com:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
E0130 15:01:02.045082       1 clustermembercontroller.go:500] key failed with : etcdserver: unhealthy cluster
I0130 15:01:02.045103       1 controlbuf.go:430] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0130 15:01:02.045264       1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"fe123f49-14ee-452f-8257-e7ed8d774511", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'ScalingFailed' etcdserver: unhealthy cluster

that kicks the operator to degraded state even if only 1 member is restarting

/assign