stolon chart deployment not working
Closed this issue · 12 comments
Hi all!
I set up a VM with minikube on one of the hosts I manage and tested with simple pods.
I tried the stolon chart: the installation runs smoothly (no errors reported, running with debug option enabled), but the service pods (proxy, sentinel, keeper) and the psql pods all enter an error state and psql pods are created repeatedly (I counted over 300 pods if left running for some time)
I tries reducing the overall resources consumption by lowering the replicas nr to 1, mem request to 256, cpu to 50m, but no success.
I didn't change anything except the above mentioned number in the chart files.
Logs are uploaded on gist here: https://gist.github.com/solidiris/362b6e5b29559e0a13680a5ded025d41
Have you ever met the same problem? Is there something wrong?
Thankyou
Hi, apologies for the late reply.
Did you figured it out?
What the description (you did not change anything and a result of get pods) you should be using etcd as a backend, but it looks like you did not deploy it.
Hello Sergey!
Thankyou for addressing a potential issue. As soon as I will be working again on this, I will try and post here the results
Same behavior.
cannot get cluster data: context deadline exceeded
I did:
git clone ...
helm install --name mine stolon/ --debug
kubectl logs -f mine-stolon-keeper-0
2018-06-08T13:14:15.894Z WARN cmd/keeper.go:158 password file permissions are too open. This file should only be readable to the user executing stolon! Continuing... {"file": "/etc/secrets/stolon/pg_repl_password", "mode": "01000000777"}
2018-06-08T13:14:15.894Z WARN cmd/keeper.go:158 password file permissions are too open. This file should only be readable to the user executing stolon! Continuing... {"file": "/etc/secrets/stolon/pg_su_password", "mode": "01000000777"}
2018-06-08T13:14:15.894Z INFO cmd/keeper.go:1914 exclusive lock on data dir taken
2018-06-08T13:14:15.896Z INFO cmd/keeper.go:486 keeper uid {"uid": "keeper0"}
2018-06-08T13:14:20.896Z ERROR cmd/keeper.go:693 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:14:25.902Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:14:35.902Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:14:45.903Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:14:55.903Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:15:05.903Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:15:15.904Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:15:25.904Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:15:35.905Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:15:45.905Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:15:55.905Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:16:05.906Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:16:15.906Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:16:25.907Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:16:35.907Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:16:45.908Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:16:55.908Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:17:05.908Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:17:15.909Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:17:25.910Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:17:35.910Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:17:45.910Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:17:55.911Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:18:05.911Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:18:15.912Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:18:25.912Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:18:35.912Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:18:45.913Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:18:55.913Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:19:05.913Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:19:15.914Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:19:25.914Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
nel-5d7688f76d-b9lms mine-stolon-sentinel-5d7688f76d-r4dd8
nseyvet@xu-nseyvet-01:/local/git/gerrit/monasca-common$ kubectl logs -f mine-stolon-sentinel-5d7688f76d-79g6h
2018-06-08T13:14:17.252Z INFO cmd/sentinel.go:1873 sentinel uid {"uid": "a218c47b"}
2018-06-08T13:14:17.252Z INFO cmd/sentinel.go:94 Trying to acquire sentinels leadership
2018-06-08T13:14:22.252Z ERROR cmd/sentinel.go:1727 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:14:32.253Z ERROR cmd/sentinel.go:1727 error retrieving cluster data {"error": "context deadline exceeded"}
My guess is that this is a pb with the following defaults:
- --store-backend=etcdv3
- --store-endpoints=http://etcd-etcd-0.etcd-etcd:2379,http://etcd-etcd-1.etcd-etcd:2379,http://etcd-etcd-2.etcd-etcd:2379
used the etcd helm chart in incubator:
helm install --name red incubator/etcd
Then
kubectl get endpoints
-> endpoints: "http://10.233.106.62:2379,http://10.233.66.234:2379,http://10.233.74.135:2379"
Then using the endpoints, it seems to work (ie no constant restarts) but I still see similar errors in the logs:
2018-06-08T14:56:03.182Z INFO cmd/proxy.go:383 proxy uid {"uid": "742aa2fb"}
2018-06-08T14:56:08.189Z INFO cmd/proxy.go:319 check function error {"error": "cannot get cluster data: context deadline exceeded"}
2018-06-08T14:56:18.189Z INFO cmd/proxy.go:279 check timeout timer fired
2018-06-08T14:56:18.196Z INFO cmd/proxy.go:319 check function error {"error": "cannot get cluster data: context deadline exceeded"}```
I updated values.yaml so that kubernetes is the default storage now, to avoid future problems of not having etcd.
Could you try it, and confirm that it works for you
Most likely your cluster-create-job failed for some reason during the initial start.
I've seen a lot of similar problems due to some misconfigured value. That's why I'm changing the default, so people could start playing with the working version, without dealing with external dependency.
Yes, that problem exists. It's a bit annoying if you used to watch pods (as I am), but most people don't care.
I used to have dependency on incubator/etcd
, which was part of this chart, until I found out that it is broken and cannot recovery restart of any pod.
helm/charts#685. But the bug is still there. So it was removed.
In terms of simplicity, kubernetes backend is as simple as dependent etcd, or even easier.
In terms of HA though, if you really care about HA, you should have separately managed multinode deployment of etcd, with backups, rolling-updates etc., and it should not depend on the chart.
Hope this make sense.
Works!
great, thanks for letting me know