timoha/hbase-k8s-operator

deleting the HBase master statefulset leads to the operator getting stuck

Opened this issue · 2 comments

tsuna commented

If the hbasemaster STS gets deleted, the operator doesn't recreate it and instead remains stuck, logging this forever:

time="2022-01-04T08:59:21Z" level=error msg="failed looking up region" backoff=1h32m13.192s err="failed to read the /hbase/master znode: zk: node does not exist" key="\"\"" table="\"\""
time="2022-01-04T10:31:34Z" level=error msg="failed looking up region" backoff=1h32m18.192s err="failed to read the /hbase/master znode: zk: node does not exist" key="\"\"" table="\"\""

Restarting the hbase-k8s-operator-controller-manager pod works around the issue as it recreates the STS upon starting back up. But it should just recreate the STS if for some reason it goes away.

I think this is one of limitations as listed

Operates only on healthy clusters, manual intervention is required in case of issues (it's the job of HBase to recover from failures)

So this would be a feature request to recover from an unhealthy cluster state.

If I remember correctly this was actually the limitation of GoHBase where it doesn't handle hbase master's going away properly (there was never a good use case to keep persistent connection to master).