stolon chart deployment not working

Question

stolon chart deployment not working

Closed this issue 6 years ago · 12 comments

Hi all!
I set up a VM with minikube on one of the hosts I manage and tested with simple pods.
I tried the stolon chart: the installation runs smoothly (no errors reported, running with debug option enabled), but the service pods (proxy, sentinel, keeper) and the psql pods all enter an error state and psql pods are created repeatedly (I counted over 300 pods if left running for some time)
I tries reducing the overall resources consumption by lowering the replicas nr to 1, mem request to 256, cpu to 50m, but no success.
I didn't change anything except the above mentioned number in the chart files.
Logs are uploaded on gist here: https://gist.github.com/solidiris/362b6e5b29559e0a13680a5ded025d41
Have you ever met the same problem? Is there something wrong?

Thankyou

nseyvet commented 6 years ago

Works!

Answer 1 · 2018-05-17T07:51:45.000Z

Hi, apologies for the late reply.
Did you figured it out?
What the description (you did not change anything and a result of get pods) you should be using etcd as a backend, but it looks like you did not deploy it.

Answer 2 · 2018-05-17T08:17:37.000Z

Hello Sergey!
Thankyou for addressing a potential issue. As soon as I will be working again on this, I will try and post here the results

Answer 3 · 2018-06-08T13:22:50.000Z

Same behavior.

cannot get cluster data: context deadline exceeded

I did:
git clone ...
helm install --name mine stolon/ --debug

kubectl logs -f mine-stolon-keeper-0
2018-06-08T13:14:15.894Z	WARN	cmd/keeper.go:158	password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...	{"file": "/etc/secrets/stolon/pg_repl_password", "mode": "01000000777"}
2018-06-08T13:14:15.894Z	WARN	cmd/keeper.go:158	password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...	{"file": "/etc/secrets/stolon/pg_su_password", "mode": "01000000777"}
2018-06-08T13:14:15.894Z	INFO	cmd/keeper.go:1914	exclusive lock on data dir taken
2018-06-08T13:14:15.896Z	INFO	cmd/keeper.go:486	keeper uid	{"uid": "keeper0"}
2018-06-08T13:14:20.896Z	ERROR	cmd/keeper.go:693	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:14:25.902Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:14:35.902Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:14:45.903Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:14:55.903Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:15:05.903Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:15:15.904Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:15:25.904Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:15:35.905Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:15:45.905Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:15:55.905Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:16:05.906Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:16:15.906Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:16:25.907Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:16:35.907Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:16:45.908Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:16:55.908Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:17:05.908Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:17:15.909Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:17:25.910Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:17:35.910Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:17:45.910Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:17:55.911Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:18:05.911Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:18:15.912Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:18:25.912Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:18:35.912Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:18:45.913Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:18:55.913Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:19:05.913Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:19:15.914Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:19:25.914Z	ERROR	cmd/keeper.go:932	error retrieving cluster data	{"error": "context deadline exceeded"}

nel-5d7688f76d-b9lms  mine-stolon-sentinel-5d7688f76d-r4dd8  
nseyvet@xu-nseyvet-01:/local/git/gerrit/monasca-common$ kubectl logs -f mine-stolon-sentinel-5d7688f76d-79g6h 
2018-06-08T13:14:17.252Z	INFO	cmd/sentinel.go:1873	sentinel uid	{"uid": "a218c47b"}
2018-06-08T13:14:17.252Z	INFO	cmd/sentinel.go:94	Trying to acquire sentinels leadership
2018-06-08T13:14:22.252Z	ERROR	cmd/sentinel.go:1727	error retrieving cluster data	{"error": "context deadline exceeded"}
2018-06-08T13:14:32.253Z	ERROR	cmd/sentinel.go:1727	error retrieving cluster data	{"error": "context deadline exceeded"}

My guess is that this is a pb with the following defaults:

- --store-backend=etcdv3
          - --store-endpoints=http://etcd-etcd-0.etcd-etcd:2379,http://etcd-etcd-1.etcd-etcd:2379,http://etcd-etcd-2.etcd-etcd:2379

Answer 4 · 2018-06-08T15:02:58.000Z

used the etcd helm chart in incubator:
helm install --name red incubator/etcd
Then
kubectl get endpoints -> endpoints: "http://10.233.106.62:2379,http://10.233.66.234:2379,http://10.233.74.135:2379"
Then using the endpoints, it seems to work (ie no constant restarts) but I still see similar errors in the logs:

2018-06-08T14:56:03.182Z	INFO	cmd/proxy.go:383	proxy uid	{"uid": "742aa2fb"}
2018-06-08T14:56:08.189Z	INFO	cmd/proxy.go:319	check function error	{"error": "cannot get cluster data: context deadline exceeded"}
2018-06-08T14:56:18.189Z	INFO	cmd/proxy.go:279	check timeout timer fired
2018-06-08T14:56:18.196Z	INFO	cmd/proxy.go:319	check function error	{"error": "cannot get cluster data: context deadline exceeded"}```

Answer 5 · 2018-06-11T18:04:58.000Z

I updated values.yaml so that kubernetes is the default storage now, to avoid future problems of not having etcd.
Could you try it, and confirm that it works for you

Answer 6 · 2018-06-11T18:13:54.000Z

Thanks. I will try that tomorrow. Any ideas about the problem w etcd?

…

On Mon, 11 Jun 2018 at 20:04, Sergey Nuzhdin ***@***.***> wrote: I updated values.yaml so that kubernetes is the default storage now, to avoid future problems of not having etcd. Could you try it, and confirm that it works for you — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#22 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AdweXCO5V7jvHF4G3xaiYSx1btO4GD-Aks5t7rFLgaJpZM4T29XA> .

Answer 7 · 2018-06-11T18:22:20.000Z

Most likely your cluster-create-job failed for some reason during the initial start.
I've seen a lot of similar problems due to some misconfigured value. That's why I'm changing the default, so people could start playing with the working version, without dealing with external dependency.

Answer 8 · 2018-06-11T18:25:56.000Z

Great! There is a limitation w Kubernetes store backend as the pod annotations keep being updated so it looks as if the pods are not running well w e.g. “kubectl get pods -w” Could it be simpler to add etcd within this chart instead? It may be better from HA perspective too.

…

On Mon, 11 Jun 2018 at 20:22, Sergey Nuzhdin ***@***.***> wrote: Most likely your cluster-create-job failed for some reason during the initial start. I've seen a lot of similar problems due to some misconfigured value. That's why I'm changing the default, so people could start playing with the working version, without dealing with external dependency. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#22 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AdweXJTDk50Q1UvZCPCH-f4db2GC23Peks5t7rVdgaJpZM4T29XA> .

Answer 9 · 2018-06-11T18:40:02.000Z

Yes, that problem exists. It's a bit annoying if you used to watch pods (as I am), but most people don't care.
I used to have dependency on incubator/etcd, which was part of this chart, until I found out that it is broken and cannot recovery restart of any pod.
helm/charts#685. But the bug is still there. So it was removed.

In terms of simplicity, kubernetes backend is as simple as dependent etcd, or even easier.

In terms of HA though, if you really care about HA, you should have separately managed multinode deployment of etcd, with backups, rolling-updates etc., and it should not depend on the chart.
Hope this make sense.

Answer 10 · 2018-06-11T18:44:27.000Z

Will test it tomorrow! Thanks

…

On Mon, 11 Jun 2018 at 20:40, Sergey Nuzhdin ***@***.***> wrote: Yes, that problem exists. It's a bit annoying if you used to watch pods (as I am), but most people don't care. I used to have dependency on incubator/etcd, which was part of this chart, until I found out that it is broken and cannot recovery restart of any pod. helm/charts#685 <helm/charts#685>. But the bug is still there. So it was removed. In terms of simplicity, kubernetes backend is as simple as dependent etcd, or even easier. In terms of HA though, if you really care about HA, you should have separately managed multinode deployment of etcd, with backups, rolling-updates etc., and it should not depend on the chart. Hope this make sense. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#22 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AdweXNbESqNEnl4JKQJENPImftCqFSiPks5t7rmDgaJpZM4T29XA> .

Answer 11 · 2018-06-12T16:53:09.000Z

great, thanks for letting me know