lwolf/stolon-chart

stolon chart deployment not working

Closed this issue · 12 comments

Hi all!
I set up a VM with minikube on one of the hosts I manage and tested with simple pods.
I tried the stolon chart: the installation runs smoothly (no errors reported, running with debug option enabled), but the service pods (proxy, sentinel, keeper) and the psql pods all enter an error state and psql pods are created repeatedly (I counted over 300 pods if left running for some time)
I tries reducing the overall resources consumption by lowering the replicas nr to 1, mem request to 256, cpu to 50m, but no success.
I didn't change anything except the above mentioned number in the chart files.
Logs are uploaded on gist here: https://gist.github.com/solidiris/362b6e5b29559e0a13680a5ded025d41
Have you ever met the same problem? Is there something wrong?

Thankyou

lwolf commented

Hi, apologies for the late reply.
Did you figured it out?
What the description (you did not change anything and a result of get pods) you should be using etcd as a backend, but it looks like you did not deploy it.

Hello Sergey!
Thankyou for addressing a potential issue. As soon as I will be working again on this, I will try and post here the results

Same behavior.

cannot get cluster data: context deadline exceeded

I did:
git clone ...
helm install --name mine stolon/ --debug

kubectl logs -f mine-stolon-keeper-0
2018-06-08T13:14:15.894Z	WARN	cmd/keeper.go:158	password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...	{"file": "/etc/secrets/stolon/pg_repl_password", "mode": "01000000777"}
2018-06-08T13:14:15.894Z	WARN	cmd/keeper.go:158	password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...	{"file": "/etc/secrets/stolon/pg_su_password", "mode": "01000000777"}
2018-06-08T13:14:15.894Z	INFO	cmd/keeper.go:1914	exclusive lock on data dir taken
2018-06-08T13:14:15.896Z	INFO	cmd/keeper.go:486	keeper uid	{"uid": "keeper0"}
2018-06-08T13:14:20.896Z	ERROR	cmd/keeper.go:693	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:14:25.902Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:14:35.902Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:14:45.903Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:14:55.903Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:15:05.903Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:15:15.904Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:15:25.904Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:15:35.905Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:15:45.905Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:15:55.905Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:16:05.906Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:16:15.906Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:16:25.907Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:16:35.907Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:16:45.908Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:16:55.908Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:17:05.908Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:17:15.909Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:17:25.910Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:17:35.910Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:17:45.910Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:17:55.911Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:18:05.911Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:18:15.912Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:18:25.912Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:18:35.912Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:18:45.913Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:18:55.913Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:19:05.913Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:19:15.914Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:19:25.914Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
nel-5d7688f76d-b9lms  mine-stolon-sentinel-5d7688f76d-r4dd8  
nseyvet@xu-nseyvet-01:/local/git/gerrit/monasca-common$ kubectl logs -f mine-stolon-sentinel-5d7688f76d-79g6h 
2018-06-08T13:14:17.252Z	INFO	cmd/sentinel.go:1873	sentinel uid	{"uid": "a218c47b"}
2018-06-08T13:14:17.252Z	INFO	cmd/sentinel.go:94	Trying to acquire sentinels leadership
2018-06-08T13:14:22.252Z	ERROR	cmd/sentinel.go:1727	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:14:32.253Z	ERROR	cmd/sentinel.go:1727	error retrieving cluster data	{"error": "context deadline exceeded"}

My guess is that this is a pb with the following defaults:

- --store-backend=etcdv3
          - --store-endpoints=http://etcd-etcd-0.etcd-etcd:2379,http://etcd-etcd-1.etcd-etcd:2379,http://etcd-etcd-2.etcd-etcd:2379

used the etcd helm chart in incubator:
helm install --name red incubator/etcd
Then
kubectl get endpoints -> endpoints: "http://10.233.106.62:2379,http://10.233.66.234:2379,http://10.233.74.135:2379"
Then using the endpoints, it seems to work (ie no constant restarts) but I still see similar errors in the logs:

2018-06-08T14:56:03.182Z	INFO	cmd/proxy.go:383	proxy uid	{"uid": "742aa2fb"}
2018-06-08T14:56:08.189Z	INFO	cmd/proxy.go:319	check function error	{"error": "cannot get cluster data: context deadline exceeded"}
2018-06-08T14:56:18.189Z	INFO	cmd/proxy.go:279	check timeout timer fired
2018-06-08T14:56:18.196Z	INFO	cmd/proxy.go:319	check function error	{"error": "cannot get cluster data: context deadline exceeded"}```
lwolf commented

I updated values.yaml so that kubernetes is the default storage now, to avoid future problems of not having etcd.
Could you try it, and confirm that it works for you

lwolf commented

Most likely your cluster-create-job failed for some reason during the initial start.
I've seen a lot of similar problems due to some misconfigured value. That's why I'm changing the default, so people could start playing with the working version, without dealing with external dependency.

lwolf commented

Yes, that problem exists. It's a bit annoying if you used to watch pods (as I am), but most people don't care.
I used to have dependency on incubator/etcd, which was part of this chart, until I found out that it is broken and cannot recovery restart of any pod.
helm/charts#685. But the bug is still there. So it was removed.

In terms of simplicity, kubernetes backend is as simple as dependent etcd, or even easier.

In terms of HA though, if you really care about HA, you should have separately managed multinode deployment of etcd, with backups, rolling-updates etc., and it should not depend on the chart.
Hope this make sense.

Works!

lwolf commented

great, thanks for letting me know