ibuildthecloud/systemd-docker

Doesn't work with recent systemd and/or docker releases

Opened this issue ยท 21 comments

Something went wrong after an update and I haven't dug into it too much since this project seems dead. Basically, systemd sees the systemd-docker executable as dying with a status code of 1, but the container is still running. I don't know if there's some mixup with Docker communication going on or if systemd changed somewhere to break it. I suspect Docker, but I have no evidence.

The best solution seems to be migrating to rkt for running the images, but that is blocked by rkt/rkt#2392 without the ability to push images to a registry.

I tried this and some other tricks, and each breaks in an interesting way. The right solution is rkt but that is not very popular unfortunately, specially in my case I needed docker volumes because of NVidia.
At the end I ended up running 2 services,

  1. first one a docker run ... service
[Unit]
Description=Main service to run
After=docker.service
Wants=docker.service

[Service]
StandardOutput=journal
StandardError=journal
Restart=on-failure
RestartSec=10
ExecStartPre=-/usr/bin/docker stop Main
ExecStartPre=-/usr/bin/docker rm Main
ExecStart=/usr/bin/docker run -h Main -i -a stdout -a stderr --rm --name Main theimagenamehere bash
ExecStop=-/usr/bin/docker stop Main
ExecStopPost=-/usr/bin/docker rm Main

[Install]
WantedBy=multi-user.target

  1. another service that will depend on the main service and actually runs my service:
[Unit]
Description=MyService
After=Main.service
Wants=Main.service

[Service]
StandardOutput=journal
StandardError=journal
Restart=on-failure
RestartSec=10
ExecStart=/opt/bin/docker_exec.sh Main /path/to/service --service-params

[Install]
WantedBy=multi-user.target

...

Here is docker_exec.sh:

#!/bin/bash

function docker_cleanup {
    docker exec $IMAGE bash -c "if [ -f $PIDFILE ]; then kill -TERM -\$(cat $PIDFILE); rm $PIDFILE; fi"
}

IMAGE=$1
PIDFILE=/tmp/docker-exec-$$
shift
trap 'kill $PID; docker_cleanup $IMAGE $PIDFILE' TERM INT
docker exec $IMAGE bash -c "echo \"\$\$\" > $PIDFILE; exec $*" &
PID=$!
wait $PID
trap - TERM INT
wait $PID

This way I can also exec multiple services in the same container.

which docker and systemd version makes it break?

Fedora 24 had 1.10.3 which worked. I'm now on 1.13.1 which isn't. systemd went from 229 to 233.

Broken on docker 17.12 and Debian stretch, fails with this error:

json: cannot unmarshal object into Go value of type string

Updating the vendored docker client fixes it.

Same failure on Ubuntu 16.04.3
Docker is Server Version: 17.12.0-ce, specifically 17.12.0ce-0ub from the PPA
Systemd is 229-4ubuntu21

What did you mean "Updating the vendored docker client" ??

@Halfwalker I'm guessing it's these projects. Does Go not have a better solution than "embed source code" for dependency management?

@Halfwalker here is what worked for me - I've built the executable in fresh golang container and then checked the binary on Ubuntu 16.04.3 LTS

go get github.com/agend07/systemd-docker
cd /go/src/github.com/agend07/systemd-docker
./build
and systemd-docker binary is in bin folder

I'm wondering how much it's really needed ... Plain old docker run --rm .... in the systemd unit file seems to be working fine. I can systemctl start|stop my_container and it all seems to work OK.

@Halfwalker What Type= is your Service? Systemd isn't doing any lifecycle management without this (e.g., if the container dies, it can't enforce the Restart= or RestartSec= actions)

I used Plex as a test container - figured that would be a good stressor ...

[Unit]
Description=Plex Media Server
After=docker.service
Requires=docker.service

[Service]
TimeoutStartSec=120
ExecStartPre=/usr/bin/docker pull plexinc/pms-docker
ExecStart=/usr/bin/docker run --rm --name=plex
--network=host
-e TZ=America/New_York
-e PLEX_UID=1000
-e PLEX_GID=1000
-v /stuff/Plex/config:/config
-v /stuff/Plex/transcode:/transcode
plexinc/pms-docker

ExecStop=/usr/bin/docker stop plex
ExecStopPost=/usr/bin/docker rm -f plex
ExecReload=/usr/bin/docker restart plex

Restart=always
RestartSec=20s
Type=notify
NotifyAccess=all

[Install]
WantedBy=multi-user.target

All the testing was done in a virtualbox VM. Regular systemctl start/stop plex worked fine. Rebooting the box worked fine.

I dont think it would work fine if the plex container crashed - as long as docker service would keep working systemd wouldn't know anything wrong happend cause it would monitor docker service, not your plex container. It is the part that when u start docker container u talk to docker, and docker starts another process (with plex) - which confuses systemd.

So check if systemd would restart your plex container after u kill it with 'kill pid' or 'docker kill'

Right - then systemd is monitoring docker, not the plex process in docker. With the unit file above though, docker would do the restart of the plex container. So while systemd wouldn't know that the plex container took a hit and restarted, the end result is the same : the plex container was restarted. systemd would only step in if docker itself died.

My preference would be for systemd to know about plex though, via systemd-docker. That just seems much cleaner.

A new build of systemd-docker seems to work. I've been pulling systemd-docker in via ansible for installs, but had to switch to plain docker run when it started failing. Now looking at what's needed to actually build a "latest" systemd-docker on a target system via ansible. @agend07 are you going to do a new release to handle the latest docker ?

Here's a simple way to build a new version of systemd-docker if you don't want to install golang etc. Requires docker though :)

docker run --rm -it -v "/usr/local/bin":/output golang:1.9 /bin/bash -c "go get github.com/agend07/systemd-docker && cd /go/src/github.com/agend07/systemd-docker && ./build && cp bin/systemd-docker /output"

i just updated from 16.04.x to 18.04.1 and now my docker-systemd is broken (error: json: cannot unmarshal object into Go value of type string)

unfortunately i cant test by building my own version because "go get github.com/agend07/systemd-docker" gives my an error: fatal: repository 'https://github.com/weaveworks/docker/' not found

what can i do to get it working again?

@firex2 try to build it again, I forked some repo with mflag into mine github. It's building but I'm not sure it would work on 18.04.1

build was fine this time, but getting errors upon starting

Aug 27 09:20:53 hs systemd[1]: Starting nginx webserver container...
-- Subject: Unit docker-webserver.service has begun start-up
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit docker-webserver.service has begun starting up.
Aug 27 09:20:55 hs docker[16408]: latest: Pulling from library/nginx
Aug 27 09:20:55 hs docker[16408]: Digest: sha256:d85914d547a6c92faa39ce7058bd7529baacab7e0cd4255442b04577c4d1f424
Aug 27 09:20:55 hs docker[16408]: Status: Image is up to date for nginx:latest
Aug 27 09:20:55 hs systemd-docker[16425]: 2018/08/27 09:20:55 open /sys/fs/cgroup/system.slice/docker.service/cgroup.procs: no such file or directory
Aug 27 09:20:55 hs systemd[1]: docker-webserver.service: Main process exited, code=exited, status=1/FAILURE
Aug 27 09:20:55 hs systemd[1]: docker-webserver.service: Failed with result 'exit-code'.
Aug 27 09:20:55 hs systemd[1]: Failed to trim compat systemd cgroup /system.slice/docker-webserver.service: Device or resource busy
Aug 27 09:20:55 hs systemd[1]: Failed to start nginx webserver container.
-- Subject: Unit docker-webserver.service has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit docker-webserver.service has failed.
--
-- The result is RESULT.
Aug 27 09:20:56 hs kernel: CIFS VFS: Error connecting to socket. Aborting operation.
Aug 27 09:20:56 hs kernel: CIFS VFS: Error connecting to socket. Aborting operation.
Aug 27 09:20:56 hs kernel: CIFS VFS: cifs_mount failed w/return code = -113
Aug 27 09:20:56 hs kernel: CIFS VFS: cifs_mount failed w/return code = -113
Aug 27 09:20:56 hs kernel: No dialect specified on mount. Default has changed to a more secure dialect, SMB2.1 or later (e.g. SMB3), from CIFS (SMB1). To use the less secure SMB1 dialect to access old servers which do not support SMB3 (or SMB2.1) specify vers=1.0 on mount.
Aug 27 09:20:56 hs kernel: No dialect specified on mount. Default has changed to a more secure dialect, SMB2.1 or later (e.g. SMB3), from CIFS (SMB1). To use the less secure SMB1 dialect to access old servers which do not support SMB3 (or SMB2.1) specify vers=1.0 on mount.
Aug 27 09:21:02 hs kernel: CIFS VFS: Error connecting to socket. Aborting operation.
Aug 27 09:21:02 hs kernel: CIFS VFS: Error connecting to socket. Aborting operation.
Aug 27 09:21:02 hs kernel: CIFS VFS: cifs_mount failed w/return code = -113
Aug 27 09:21:02 hs kernel: CIFS VFS: cifs_mount failed w/return code = -113
Aug 27 09:21:02 hs kernel: No dialect specified on mount. Default has changed to a more secure dialect, SMB2.1 or later (e.g. SMB3), from CIFS (SMB1). To use the less secure SMB1 dialect to access old servers which do not support SMB3 (or SMB2.1) specify vers=1.0 on mount.
Aug 27 09:21:02 hs kernel: No dialect specified on mount. Default has changed to a more secure dialect, SMB2.1 or later (e.g. SMB3), from CIFS (SMB1). To use the less secure SMB1 dialect to access old servers which do not support SMB3 (or SMB2.1) specify vers=1.0 on mount.

@firex2 try to google: '/sys/fs/cgroup/system.slice/docker.service/cgroup.procs: no such file or directory'

what i found: moby/moby#17653 (there is some possible workaround (moby/moby#17653 (comment))

moby/moby#27633

does it happen when you run the service with plain docker (instead of systemd-docker)?

if i run it from plain command line, everything is fine with just "docker run ..."

if i only change from systemd-docker to docker in my systemd-service it cannot start because of timeout

it does not help to add "{"exec-opts": ["native.cgroupdriver=systemd"]}" in the docker config file

-- Subject: Unit libcontainer_29752_systemd_test_default.slice has finished start-up
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit libcontainer_29752_systemd_test_default.slice has finished starting up.
--
-- The start-up result is RESULT.
Aug 27 10:10:35 hs systemd[1]: Removed slice libcontainer_29752_systemd_test_default.slice.
-- Subject: Unit libcontainer_29752_systemd_test_default.slice has finished shutting down
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit libcontainer_29752_systemd_test_default.slice has finished shutting down.
Aug 27 10:10:35 hs dockerd[24254]: time="2018-08-27T10:10:35+02:00" level=info msg="shim reaped" id=a78e7de3c0fc49f9cf3d10c4f12577dd69fb07fa4f85e7a7ad7e5658a3e36e01
Aug 27 10:10:35 hs dockerd[24254]: time="2018-08-27T10:10:35.717917049+02:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Aug 27 10:10:35 hs systemd-networkd[878]: vethb885f58: Lost carrier
Aug 27 10:10:35 hs systemd-timesyncd[834]: Network configuration changed, trying to establish connection.
Aug 27 10:10:35 hs systemd-timesyncd[834]: Synchronized to time server 10.0.0.1:123 (10.0.0.1).
Aug 27 10:10:35 hs systemd[1]: Starting resolvconf-pull-resolved.service...
-- Subject: Unit resolvconf-pull-resolved.service has begun start-up
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit resolvconf-pull-resolved.service has begun starting up.
Aug 27 10:10:35 hs kernel: docker0: port 1(vethb885f58) entered disabled state
Aug 27 10:10:35 hs kernel: veth6b24488: renamed from eth0
Aug 27 10:10:35 hs systemd[1]: Started resolvconf-pull-resolved.service.
-- Subject: Unit resolvconf-pull-resolved.service has finished start-up
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit resolvconf-pull-resolved.service has finished starting up.
--
-- The start-up result is RESULT.
Aug 27 10:10:35 hs kernel: docker0: port 1(vethb885f58) entered disabled state
Aug 27 10:10:35 hs systemd-timesyncd[834]: Network configuration changed, trying to establish connection.
Aug 27 10:10:35 hs systemd-timesyncd[834]: Synchronized to time server 10.0.0.1:123 (10.0.0.1).
Aug 27 10:10:35 hs networkd-dispatcher[1177]: WARNING:Unknown index 95 seen, reloading interface list
Aug 27 10:10:35 hs kernel: device vethb885f58 left promiscuous mode
Aug 27 10:10:35 hs kernel: docker0: port 1(vethb885f58) entered disabled state
Aug 27 10:10:35 hs systemd[1]: Starting resolvconf-pull-resolved.service...
-- Subject: Unit resolvconf-pull-resolved.service has begun start-up
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit resolvconf-pull-resolved.service has begun starting up.
Aug 27 10:10:35 hs systemd[1]: Started resolvconf-pull-resolved.service.
-- Subject: Unit resolvconf-pull-resolved.service has finished start-up
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit resolvconf-pull-resolved.service has finished starting up.
--
-- The start-up result is RESULT.
Aug 27 10:10:35 hs systemd-udevd[29787]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Aug 27 10:10:35 hs networkd-dispatcher[1177]: ERROR:Unknown interface index 95 seen even after reload
Aug 27 10:10:35 hs systemd-udevd[29787]: link_config: could not get ethtool features for veth6b24488
Aug 27 10:10:35 hs systemd-udevd[29787]: Could not set offload features of veth6b24488: No such device
Aug 27 10:10:35 hs systemd[1]: docker-webserver.service: Failed with result 'timeout'.
Aug 27 10:10:35 hs systemd[1]: Failed to start nginx webserver container.
-- Subject: Unit docker-webserver.service has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit docker-webserver.service has failed.
--
-- The result is RESULT.
Aug 27 10:10:36 hs systemd-networkd[878]: docker0: Lost carrier
Aug 27 10:10:36 hs systemd-timesyncd[834]: Network configuration changed, trying to establish connection.
Aug 27 10:10:36 hs systemd-timesyncd[834]: Synchronized to time server 10.0.0.1:123 (10.0.0.1).
Aug 27 10:10:36 hs systemd[1]: Starting resolvconf-pull-resolved.service...
-- Subject: Unit resolvconf-pull-resolved.service has begun start-up
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit resolvconf-pull-resolved.service has begun starting up.
Aug 27 10:10:36 hs systemd[1]: Started resolvconf-pull-resolved.service.
-- Subject: Unit resolvconf-pull-resolved.service has finished start-up
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit resolvconf-pull-resolved.service has finished starting up.
--
-- The start-up result is RESULT.

I was able to get systemd-docker working with Ubuntu 18.04 by:

  1. Building systemd-docker from scratch from @agend07 's fork, per @agend07 and @Halfwalker discussed above.
  2. Adding --cgroups name=systemd just after systemd-docker in the unit file, per https://container-solutions.com/running-docker-containers-with-systemd/

My guess is that docker defaults to not using systemd for cgroups because "the delegate issues still exists and systemd currently does not support the cgroup feature set required for containers run by docker" (per the docker.service unit file), and I expect systemd-docker is expecting systemd for the cgroups, hence the open /sys/fs/cgroup/system.slice/docker.service/cgroup.procs: no such file or directory error. Setting --cgroups name=systemd apparently overrides the docker default, however, I cannot say what side-effects this may have, given the ominous note in the docker.service unit file.

You can get a binary for AMD64 Linux (tested on CoreOS and Ubuntu 18.04) at https://github.com/subdavis/systemd-docker/releases/tag/1.0.0

This is just a build of @agend07's fork.

To make it work with Docker 19.03.6 on Debian 9.12, I had to build @agend07's fork with Go 1.13.

Go 1.9 would not work.

If anybody needs it, here is the compressed resulting binary

systemd-docker.zip