Search engine: docker deployment issues
Opened this issue · 5 comments
Possibly affects the IDR monitoring stack as well
Initially reported by @dominikl in the context of a pilot VM,
deployment/ansible/idr-docker.yml
Lines 7 to 8 in 0ec6d8d
RUNNING HANDLER [ome.docker : restart docker] *****************************************************************************************************************************************************************
fatal: [test120-searchengine]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python"}, "changed": false, "msg": "Unable to restart service docker: Job for docker.service failed because the control process exited with error code. See \"systemctl status docker.service\" and \"journalctl -xe\" for details.\n"}
Looking at the logs
Feb 02 13:42:21 test120-searchengine.novalocal dockerd[26622]: time="2024-02-02T13:42:21.663010221Z" level=info msg="Starting up"
Feb 02 13:42:22 test120-searchengine.novalocal dockerd[26622]: time="2024-02-02T13:42:22.450727005Z" level=info msg="[graphdriver] using prior storage driver: overlay2"
Feb 02 13:42:22 test120-searchengine.novalocal dockerd[26622]: time="2024-02-02T13:42:22.451427414Z" level=info msg="Loading containers: start."
Feb 02 13:42:22 test120-searchengine.novalocal dockerd[26622]: time="2024-02-02T13:42:22.529980377Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP
Feb 02 13:42:22 test120-searchengine.novalocal dockerd[26622]: time="2024-02-02T13:42:22.530549152Z" level=error msg="Failed to set bridge MTU docker0 via netlink" error="invalid argument"
Feb 02 13:42:22 test120-searchengine.novalocal dockerd[26622]: time="2024-02-02T13:42:22.532190944Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
Feb 02 13:42:22 test120-searchengine.novalocal dockerd[26622]: failed to start daemon: Error initializing network controller: error creating default "bridge" network: invalid argument
Feb 02 13:42:22 test120-searchengine.novalocal systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Feb 02 13:42:22 test120-searchengine.novalocal systemd[1]: Failed to start Docker Application Container Engine.
Removing /etc/docker/daemon.json or simply commenting out the mtu variable (as docker_use_ipv4_nic_mtu: false) suffices to restart the Docker service. But docker ps fails with
[sbesson@test120-searchengine ~]$ sudo docker ps
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
The version of Docker is
[sbesson@test120-searchengine ~]$ docker -v
Docker version 25.0.2, build 29cf629
while on a recent successful environment, it is
[sbesson@prod120-searchengine ~]$ docker -v
Docker version 24.0.7, build afdd53b
Forcing the Docker version to 24.0.7
diff --git a/ansible/idr-docker.yml b/ansible/idr-docker.yml
index 2a53643..e87fc6a 100644
--- a/ansible/idr-docker.yml
+++ b/ansible/idr-docker.yml
@@ -6,7 +6,7 @@
roles:
- role: ome.docker
docker_use_ipv4_nic_mtu: True
-
+ docker_version: 24.0.7
tasks:
- name: install docker-python
become: yesseems to be sufficient to make progress with the playbook. So I suspect some upstream changes incompatible with our way to deploy Docker using ome.docker.
moby/moby#47308 looks related and is expected to be resolved with Docker 25.0.3 (or the migration to Rocky Linux 9)
When testing devspace using the testing RHEL 9 VM, I had to edit the dockerd file
What is currently in is
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
and it is expecting
ExecStart=/usr/bin/dockerd -H unix:///var/run/docker.sock --containerd=/run/containerd/containerd.sock
Note that i did not have the issue on the physical RHEL 9 machine
Downgrading to 24.x version might also solve the problem I have when running devspace (omero-server takes a long time to start). I am currently running
docker --version
Docker version 25.0.2, build 29cf629
I was able to spin up test120 on Friday by downgrading Docker to the last 24.x version. Pushed 825c70b accordingly so that we unblock the creation of production & pilot environments. Once Docker 25.0.3 is released or we migrate to Rocky Linux 9, we can evaluate dropping the version pinning.