gluster-blockd and gluster-block-setup not starting after reboot of node: wrong place for 'systemctl enable...' in update-params.sh ?

Question

gluster-blockd and gluster-block-setup not starting after reboot of node: wrong place for 'systemctl enable...' in update-params.sh ?

Closed this issue 6 years ago · 8 comments

Hi,

we have a strange issue on OpenShift 3.11 with GlusterFS (latest version installed yesterday):

OpenShift installs GlusterFS, glusterfs-storage pods start successfully.

At some point during the installation procedure glusterfs nodes are rebooted. Afterwards the glusterfs-storage pods don't start.

oc logs glusterfs-xxx shows:

Enabling gluster-block service and updating env. variables
Failed to get D-Bus connection: Connection refused
Failed to get D-Bus connection: Connection refused

If I exec into the container and call

systemctl status glusterd

It's enabled and running.

If I do

systemctl status gluster-blockd
or 
systemctl status gluster-block-setup

The service is reported as disabled and doesn't run.

If I call

systemctl enable gluster-blockd
systemctl start gluster-blockd

and leave the pod after a few seconds the glusterfs-storage pod is reported as running by the Openshift oc command.

I can get the same result by deleting the glusterfs-storage pod. Thus we have a daemonset in OpenShift the pods restart and are running immediately. The logs dont show any D-Bus errors in this case!

It looks like some weird race condition where the GlusterFS container behaves differently on a reboot as on a pod restart command.

Best regards,

Josef

jomeier commented 6 years ago

Ack

Answer 1 · 2019-01-07T23:59:28.000Z

The services gluster-blockd and gluster-block-setup are both started correctly after a reboot if the commands

systemctl enable gluster-blockd
systemctl enable gluster-block-setup

are called in the Dockerfile and not in the script update-params.sh ! :-)

That already was implemented in a former version of the Docker image but got changed a few days ago.

Could you confirm that, please?

Answer 2 · 2019-01-08T12:49:09.000Z

systemctl enable gluster-blockd
systemctl enable gluster-block-setup
are called in the Dockerfile and not in the script update-params.sh ! :-)

That already was implemented in a former version of the Docker image but got changed a few days ago.

Yes, the systemctl enable is moved from Dockerfile to update-params.sh recently. This is for introducing a new variable GLUSTER_BLOCK_ENABLED.

By any chance, have you passed GLUSTER_BLOCK_ENABLED=FALSE ?

Answer 3 · 2019-01-08T14:36:11.000Z

@SaravanaStorageNetwork
Thanks for your answer. I don't touch the GLUSTER_BLOCK_ENABLED variable. It's set to TRUE because I see the echo message in the IF part of the if ... then ... else. And I see also two D-Bus errors for each of the systemctl commands.

It seems definitely to be the case that D-Bus is not available if the Container starts after a Reboot. If I restart the container manually by killing it, the both systemctl commands are called successfully without D-Bus errors.

I created a new version of the Docker Image with the both systemctl enable commands being in the Dockerfile. With this image the glusterfs-storage containers start successfully also after a reboot.

Answer 4 · 2019-01-09T05:55:42.000Z

@SaravanaStorageNetwork
Thanks for your answer. I don't touch the GLUSTER_BLOCK_ENABLED variable. It's set to TRUE because I see the echo message in the IF part of the if ... then ... else. And I see also two D-Bus errors for each of the systemctl commands.

It seems definitely to be the case that D-Bus is not available if the Container starts after a Reboot. If I restart the container manually by killing it, the both systemctl commands are called successfully without D-Bus errors.

Thanks for the pointers. This looks little weird, I will get back on this after some debugging.

Answer 5 · 2019-01-09T09:00:10.000Z

@SaravanaStorageNetwork
To be more precise: I didn't restart a "GlusterFS-Storage" container but the glusterfs-storage pod.

Answer 6 · 2019-01-11T13:42:02.000Z

I get the same issue - after a Node / host reboots the glusterfs container fails it health check

Events:
Type Reason Age From Message

Normal SandboxChanged 39m kubelet, k8snode01 Pod sandbox changed, it will be killed and re-created.
Normal Pulled 39m kubelet, k8snode01 Container image "gluster/gluster-centos:latest" already present on machine
Normal Created 39m kubelet, k8snode01 Created container
Normal Started 38m kubelet, k8snode01 Started container
Warning Unhealthy 38m kubelet, k8snode01 Readiness probe failed: /usr/local/bin/status-probe.sh
failed check: systemctl -q is-active glusterd.service
Warning Unhealthy 38m kubelet, k8snode01 Liveness probe failed: /usr/local/bin/status-probe.sh
failed check: systemctl -q is-active glusterd.service
Warning Unhealthy 34m (x9 over 37m) kubelet, k8snode01 Liveness probe failed: /usr/local/bin/status-probe.sh
failed check: systemctl -q is-active gluster-blockd.service
Warning Unhealthy 3m (x81 over 37m) kubelet, k8snode01 Readiness probe failed: /usr/local/bin/status-probe.sh
failed check: systemctl -q is-active gluster-blockd.service

and looking at the docker logs:

[root@k8snode01 ~]# docker logs 2124ac6f2e00
Enabling gluster-block service and updating env. variables
Failed to get D-Bus connection: Connection refused
Failed to get D-Bus connection: Connection refused

Exec in to the container and it reports it is running:

[root@k8snode01 ~]# docker exec -it ae80e12af2e0 systemctl status glusterd
● glusterd.service - GlusterFS, a clustered file-system server
Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2019-01-11 13:26:06 UTC; 13min ago
Process: 141 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 142 (glusterd)
CGroup: /kubepods/burstable/poda9ebe292-14e8-11e9-be34-005056beca32/ae80e12af2e033db097d8e9906128a7e371d3364c0e2bdb40809e03268f5df64/system.slice/glusterd.service
├─142 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO...
├─180 /usr/sbin/glusterfs -s localhost --volfile-id gluster/gluste...
├─189 /usr/sbin/glusterfsd -s 10.34.88.166 --volfile-id heketidbst...
├─198 /usr/sbin/glusterfsd -s 10.34.88.166 --volfile-id vol_512ee7...
├─206 /usr/sbin/glusterfsd -s 10.34.88.166 --volfile-id vol_55f566...
├─216 /usr/sbin/glusterfsd -s 10.34.88.166 --volfile-id vol_805eb7...
└─225 /usr/sbin/glusterfsd -s 10.34.88.166 --volfile-id vol_f15984...

Jan 11 13:26:03 k8snode01 systemd[1]: Starting GlusterFS, a clustered file-.....
Jan 11 13:26:06 k8snode01 systemd[1]: Started GlusterFS, a clustered file-s...r.
Hint: Some lines were ellipsized, use -l to show in full.

Doing this fixes it (from the node):

[root@k8snode01 ~]# docker exec -it ae80e12af2e0 systemctl enable gluster-blockd
Created symlink from /etc/systemd/system/multi-user.target.wants/gluster-blockd.service to /usr/lib/systemd/system/gluster-blockd.service.
[root@k8snode01 ~]# docker exec -it ae80e12af2e0 systemctl start gluster-blockd

Answer 7 · 2019-01-12T02:55:18.000Z

@jomeier @kesterriley Thanks! I have reverted the PR which introduced the change (of moving systemctl commands from Dockerfile to shellscript).