Data node Crash Loop
ivanovaleksandar opened this issue ยท 7 comments
The data node is constantly restarting in the initialization phase after working well for few days.
kubectl logs es-data-0 -f
[2018-07-09T07:53:35,420][INFO ][o.e.n.Node ] [es-data-0] initializing ...
[2018-07-09T07:53:35,572][INFO ][o.e.e.NodeEnvironment ] [es-data-0] using [1] data paths, mounts [[/data (<ip-addr>:vol_016257cd4643b18711209d769f93e979)]], net usable_space [1.9gb], net total_space [2.9gb], types [fuse.glusterfs]
[2018-07-09T07:53:35,573][INFO ][o.e.e.NodeEnvironment ] [es-data-0] heap size [2.9gb], compressed ordinary object pointers [true]
After this it crashes and starts a new init once again.
This the the yaml that I am using. I've set up requests and limits accordingly (the java heap is smaller that the pod limits, as people suggested), but with no avail.
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: es-data
labels:
component: elasticsearch
role: data
spec:
serviceName: elasticsearch-data
replicas: 2
template:
metadata:
labels:
component: elasticsearch
role: data
spec:
initContainers:
- name: init-sysctl
image: busybox:1.27.2
command:
- sysctl
- -w
- vm.max_map_count=262144
securityContext:
privileged: true
containers:
- name: es-data
image: quay.io/pires/docker-elasticsearch-kubernetes:6.3.0
env:
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: CLUSTER_NAME
value: myesdb
- name: NODE_MASTER
value: "false"
- name: NODE_INGEST
value: "false"
- name: HTTP_ENABLE
value: "true"
- name: ES_JAVA_OPTS
value: -Xms3g -Xmx3g
- name: PROCESSORS
valueFrom:
resourceFieldRef:
resource: limits.cpu
resources:
requests:
memory: 3Gi
limits:
cpu: 2
memory: 4Gi
ports:
- containerPort: 9200
name: http
- containerPort: 9300
name: transport
livenessProbe:
tcpSocket:
port: transport
initialDelaySeconds: 20
periodSeconds: 10
readinessProbe:
httpGet:
path: /_cluster/health
port: http
initialDelaySeconds: 20
timeoutSeconds: 5
volumeMounts:
- name: storage
mountPath: /data
volumeClaimTemplates:
- metadata:
name: storage
spec:
storageClassName: gluster-heketi-external
accessModes: [ ReadWriteOnce ]
resources:
requests:
storage: 3Gi
Any idea or suggestion?
Does kubectl describe po/es-data-0
shows any kind of timeouts / issues during startup observed by kubernetes?
Eventually the pod gets killed because it does not startup fast enough to answer to the livenessProbe.
Additionally it's not a good idea to allocate almost all of the available memory for the java heap:
[...]
- name: ES_JAVA_OPTS
value: -Xms3g -Xmx3g
[...]
VS
[...]
resources:
requests:
memory: 3Gi
limits:
cpu: 2
memory: 4Gi
[...]
Per elastic documentation
Set Xmx to no more than 50% of your physical RAM, to ensure that there is enough physical RAM left for kernel file system caches.
https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
I've changed the ES_JAVA_OPTS to be 50% of the limits (and different variations of those parameters), but still the same crash loop.
Also, I don't see anything strange in the events log, besides the readiness probe failure when trying to reach _cluster/health
endpoint, but that is because the init process has not came to the point to start up the service successfully.
kubectl describe pod es-data-0
Name: es-data-0
Namespace: default
Node: kubernetes-node1/<ip-addr>
Start Time: Mon, 09 Jul 2018 10:30:02 +0200
Labels: component=elasticsearch
controller-revision-hash=es-data-776697d896
role=data
statefulset.kubernetes.io/pod-name=es-data-0
Annotations: <none>
Status: Running
IP: 10.44.0.0
Controlled By: StatefulSet/es-data
Init Containers:
init-sysctl:
Container ID: docker://320600efc4f4e2450933de60300b04b62fc442b422f55db0636c42ace9750115
Image: busybox:1.27.2
Image ID: docker-pullable://busybox@sha256:bbc3a03235220b170ba48a157dd097dd1379299370e1ed99ce976df0355d24f0
Port: <none>
Host Port: <none>
Command:
sysctl
-w
vm.max_map_count=262144
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 09 Jul 2018 10:30:03 +0200
Finished: Mon, 09 Jul 2018 10:30:03 +0200
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-875d5 (ro)
Containers:
es-data:
Container ID: docker://0ab42eec948bfb61180ff55d9c431e9cbc7afb719fe47a2b76c716e6f0a726cc
Image: quay.io/pires/docker-elasticsearch-kubernetes:6.3.0
Image ID: docker-pullable://quay.io/pires/docker-elasticsearch-kubernetes@sha256:dcd3e9db3d2c6b9a448d135aebcacac30a4cca655d42efaa115aa57405cd22f3
Ports: 9200/TCP, 9300/TCP
Host Ports: 0/TCP, 0/TCP
State: Running
Started: Mon, 09 Jul 2018 10:30:49 +0200
Last State: Terminated
Reason: Error
Exit Code: 143
Started: Mon, 09 Jul 2018 10:30:05 +0200
Finished: Mon, 09 Jul 2018 10:30:49 +0200
Ready: False
Restart Count: 1
Limits:
cpu: 2
memory: 4Gi
Requests:
cpu: 2
memory: 2Gi
Liveness: tcp-socket :transport delay=20s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:http/_cluster/health delay=20s timeout=5s period=10s #success=1 #failure=3
Environment:
NAMESPACE: default (v1:metadata.namespace)
NODE_NAME: es-data-0 (v1:metadata.name)
CLUSTER_NAME: myesdb
NODE_MASTER: false
NODE_INGEST: false
HTTP_ENABLE: true
ES_JAVA_OPTS: -Xms2g -Xmx2g
PROCESSORS: 2 (limits.cpu)
Mounts:
/data from storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-875d5 (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
storage:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: storage-es-data-0
ReadOnly: false
default-token-875d5:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-875d5
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 1m default-scheduler Successfully assigned es-data-0 to kubernetes-node1
Normal SuccessfulMountVolume 1m kubelet, kubernetes-node1 MountVolume.SetUp succeeded for volume "default-token-875d5"
Normal SuccessfulMountVolume 1m kubelet, kubernetes-node1 MountVolume.SetUp succeeded for volume "pvc-306d82b3-805e-11e8-ac5c-7625bd182864"
Normal Pulled 59s kubelet, kubernetes-node1 Container image "busybox:1.27.2" already present on machine
Normal Created 59s kubelet, kubernetes-node1 Created container
Normal Started 59s kubelet, kubernetes-node1 Started container
Warning Unhealthy 19s (x2 over 29s) kubelet, kubernetes-node1 Readiness probe failed: Get http://10.44.0.0:9200/_cluster/health: dial tcp 10.44.0.0:9200: getsockopt: connection refused
Warning Unhealthy 14s (x3 over 34s) kubelet, kubernetes-node1 Liveness probe failed: dial tcp 10.44.0.0:9300: getsockopt: connection refused
Normal Pulled 13s (x2 over 57s) kubelet, kubernetes-node1 Container image "quay.io/pires/docker-elasticsearch-kubernetes:6.3.0" already present on machine
Normal Created 13s (x2 over 57s) kubelet, kubernetes-node1 Created container
Normal Started 13s (x2 over 57s) kubelet, kubernetes-node1 Started container
Normal Killing 13s kubelet, kubernetes-node1 Killing container with id docker://es-data:Container failed liveness probe.. Container will be killed and recreated.
But you can see that, like I already assumed, the liveness
probe is failing and therefore killing the container :-/
Warning Unhealthy 14s (x3 over 34s) kubelet, kubernetes-node1 Liveness probe failed: dial tcp 10.44.0.0:9300: getsockopt: connection refused
Normal Killing 13s kubelet, kubernetes-node1 Killing container with id docker://es-data:Container failed liveness probe.. Container will be killed and recreated.
Could you try to increase the initialDelaySeconds
of the livenessProbe to like 2 or 5 minutes or so to see if it then will come up? Afterwards you can reduce the delay to a more close value to the required start time.
Hey, that worked. :)
That was a trivial overlook from my side. But, as the cluster gets in some data to process and startup the services, the readiness/liveliness need to be adjusted probably.
Thank you @mat1010 ! I will close the issue now.
I'm so glad other people have these issues before me
@ivanovaleksandar it appears your ElasticSearch data node is running with GlusterFS persistent storage. Do you run into issue related to CorruptIndexException errors?