Etcd cluster in CrashLoopBackOff loop

Question

Etcd cluster in CrashLoopBackOff loop

timfpark opened this issue 8 years ago · 4 comments

I've been running a Stolon cluster for about a week (very successfully) but today I noticed that I have lost a etcd pod completely and another is a CrashLoopBackOff cycle:

postgresql               postgresql-etcd-0                                  0/1       CrashLoopBackOff   159        13h
postgresql               postgresql-etcd-2                                  1/1       Running            0          1d
postgresql               postgresql-stolon-keeper-0                         1/1       Running            0          13h
postgresql               postgresql-stolon-keeper-1                         1/1       Running            0          1d
postgresql               postgresql-stolon-keeper-2                         1/1       Running            0          1d
postgresql               postgresql-stolon-proxy-3377369672-4vq28           0/1       Running            0          13h
postgresql               postgresql-stolon-proxy-3377369672-5jsd5           0/1       Running            0          13h
postgresql               postgresql-stolon-proxy-3377369672-qrxm6           0/1       Running            0          1d
postgresql               postgresql-stolon-sentinel-2884560845-fwc9w        1/1       Running            0          1d
postgresql               postgresql-stolon-sentinel-2884560845-r34nv        1/1       Running            0          13h
postgresql               postgresql-stolon-sentinel-2884560845-wgp4q        1/1       Running            0          1d

The logs for postgresql-etcd-0 are the following:

Re-joining etcd member
cat: can't open '/var/run/etcd/member_id': No such file or directory

Have you seen this before? Is there anyway to manually restart the etcd portion of the cluster easily?

Answer 1 · 2017-04-01T18:08:22.000Z

Hi Tim,
Unfortunately yes.
I mentioned it in the readme for this chart and created an issue in kubernetes/charts#685

I didn't try it, but theoretically, it should be possible to manually delete lost etcd members from the cluster and then scale up the cluster.

Answer 2 · 2017-04-26T22:50:33.000Z

Thanks for your answer and sorry for missing it in the README

Answer 3 · 2018-02-16T19:11:43.000Z

For anyone coming here in future. You can use this config to create the etcd cluster instead (tested only on GKE). The pull request here kubernetes/charts#685 didn't work for me on GKE.

Answer 4 · 2020-05-19T12:20:31.000Z

Workaround for this without loosing you data or recreating your whole cluster.
Use helm to scale down your cluster by 1 node
helm upgrade etcd incubator/etcd --set replicas=2
Wait for few minutes and all nodes will do rolling restart.
Scale it back up and voila :)