lwolf/stolon-chart

Etcd cluster in CrashLoopBackOff loop

timfpark opened this issue · 4 comments

I've been running a Stolon cluster for about a week (very successfully) but today I noticed that I have lost a etcd pod completely and another is a CrashLoopBackOff cycle:

postgresql               postgresql-etcd-0                                  0/1       CrashLoopBackOff   159        13h
postgresql               postgresql-etcd-2                                  1/1       Running            0          1d
postgresql               postgresql-stolon-keeper-0                         1/1       Running            0          13h
postgresql               postgresql-stolon-keeper-1                         1/1       Running            0          1d
postgresql               postgresql-stolon-keeper-2                         1/1       Running            0          1d
postgresql               postgresql-stolon-proxy-3377369672-4vq28           0/1       Running            0          13h
postgresql               postgresql-stolon-proxy-3377369672-5jsd5           0/1       Running            0          13h
postgresql               postgresql-stolon-proxy-3377369672-qrxm6           0/1       Running            0          1d
postgresql               postgresql-stolon-sentinel-2884560845-fwc9w        1/1       Running            0          1d
postgresql               postgresql-stolon-sentinel-2884560845-r34nv        1/1       Running            0          13h
postgresql               postgresql-stolon-sentinel-2884560845-wgp4q        1/1       Running            0          1d

The logs for postgresql-etcd-0 are the following:

Re-joining etcd member
cat: can't open '/var/run/etcd/member_id': No such file or directory

Have you seen this before? Is there anyway to manually restart the etcd portion of the cluster easily?

lwolf commented

Hi Tim,
Unfortunately yes.
I mentioned it in the readme for this chart and created an issue in kubernetes/charts#685

I didn't try it, but theoretically, it should be possible to manually delete lost etcd members from the cluster and then scale up the cluster.

Thanks for your answer and sorry for missing it in the README

For anyone coming here in future. You can use this config to create the etcd cluster instead (tested only on GKE). The pull request here kubernetes/charts#685 didn't work for me on GKE.

Workaround for this without loosing you data or recreating your whole cluster.
Use helm to scale down your cluster by 1 node
helm upgrade etcd incubator/etcd --set replicas=2
Wait for few minutes and all nodes will do rolling restart.
Scale it back up and voila :)