Etcd cluster in CrashLoopBackOff loop
timfpark opened this issue · 4 comments
I've been running a Stolon cluster for about a week (very successfully) but today I noticed that I have lost a etcd pod completely and another is a CrashLoopBackOff cycle:
postgresql postgresql-etcd-0 0/1 CrashLoopBackOff 159 13h
postgresql postgresql-etcd-2 1/1 Running 0 1d
postgresql postgresql-stolon-keeper-0 1/1 Running 0 13h
postgresql postgresql-stolon-keeper-1 1/1 Running 0 1d
postgresql postgresql-stolon-keeper-2 1/1 Running 0 1d
postgresql postgresql-stolon-proxy-3377369672-4vq28 0/1 Running 0 13h
postgresql postgresql-stolon-proxy-3377369672-5jsd5 0/1 Running 0 13h
postgresql postgresql-stolon-proxy-3377369672-qrxm6 0/1 Running 0 1d
postgresql postgresql-stolon-sentinel-2884560845-fwc9w 1/1 Running 0 1d
postgresql postgresql-stolon-sentinel-2884560845-r34nv 1/1 Running 0 13h
postgresql postgresql-stolon-sentinel-2884560845-wgp4q 1/1 Running 0 1d
The logs for postgresql-etcd-0 are the following:
Re-joining etcd member
cat: can't open '/var/run/etcd/member_id': No such file or directory
Have you seen this before? Is there anyway to manually restart the etcd portion of the cluster easily?
Hi Tim,
Unfortunately yes.
I mentioned it in the readme for this chart and created an issue in kubernetes/charts#685
I didn't try it, but theoretically, it should be possible to manually delete lost etcd members from the cluster and then scale up the cluster.
Thanks for your answer and sorry for missing it in the README
For anyone coming here in future. You can use this config to create the etcd cluster instead (tested only on GKE). The pull request here kubernetes/charts#685 didn't work for me on GKE.
Workaround for this without loosing you data or recreating your whole cluster.
Use helm to scale down your cluster by 1 node
helm upgrade etcd incubator/etcd --set replicas=2
Wait for few minutes and all nodes will do rolling restart.
Scale it back up and voila :)