Docker swarm incompatability
Zagitta opened this issue · 6 comments
It might be worth to write in the documentation this won't work on docker swarm due to the requirement of privileged mode.
The database and web containers will work just fine however the worker node will fail with some very cryptic error messages like:
{"timestamp":"2019-09-30T14:31:24.520408669Z","level":"error","source":"guardian","message":"guardian.starting-guardian-backend","data":{"error":"bulk starter: mounting subsystem 'cpuset' in '/sys/fs/cgroup/cpuset': operation not permitted"}}
and
{"timestamp":"2019-09-30T14:31:24.528488853Z","level":"error","source":"worker","message":"worker.garden-runner.logging-runner-exited","data":{"error":"Exit trace for group:\ngdn exited with error: exit status 1\ndns-proxy exited with nil\n","session":"8"}}
which disappears rather quickly because the following error gets spammed repeatedly
{"timestamp":"2019-09-30T14:31:28.144058311Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.forward-conn.failed-to-dial","data":{"addr":"127.0.0.1:7777","error":"dial tcp 127.0.0.1:7777: connect: connection refused","network":"tcp","session":"4.1.5"}}
The web node also registers the worker node leading to further confusion.
Hopefully this saves someone else a couple of painful hours.
THe same issue:
worker_1 | {"timestamp":"2021-01-18T13:19:06.540640000Z","level":"error","source":"baggageclaim","message":"baggageclaim.fs.run-command.failed","data":{"args":["bash","-e","-x","-c","\n\t\tif [ ! -e $IMAGE_PATH ] || [ \"$(stat --printf=\"%s\" $IMAGE_PATH)\" != \"$SIZE_IN_BYTES\" ]; then\n\t\t\ttouch $IMAGE_PATH\n\t\t\ttruncate -s ${SIZE_IN_BYTES} $IMAGE_PATH\n\t\tfi\n\n\t\tlo=\"$(losetup -j $IMAGE_PATH | cut -d':' -f1)\"\n\t\tif [ -z \"$lo\" ]; then\n\t\t\tlo=\"$(losetup -f --show $IMAGE_PATH)\"\n\t\tfi\n\n\t\tif ! file $IMAGE_PATH | grep BTRFS; then\n\t\t\tmkfs.btrfs --nodiscard $IMAGE_PATH\n\t\tfi\n\n\t\tmkdir -p $MOUNT_PATH\n\n\t\tif ! mountpoint -q $MOUNT_PATH; then\n\t\t\tmount -t btrfs -o discard $lo $MOUNT_PATH\n\t\tfi\n\t"],"command":"/bin/bash","env":["PATH=/usr/local/concourse/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin","MOUNT_PATH=/worker-state/volumes","IMAGE_PATH=/worker-state/volumes.img","SIZE_IN_BYTES=258752974848"],"error":"exit status 1","session":"3.1","stderr":"+ '[' '!' -e /worker-state/volumes.img ']'\n+ touch /worker-state/volumes.img\n+ truncate -s 258752974848 /worker-state/volumes.img\n++ losetup -j /worker-state/volumes.img\n++ cut -d: -f1\n+ lo=\n+ '[' -z '' ']'\n++ losetup -f --show /worker-state/volumes.img\nlosetup: cannot find an unused loop device\n+ lo=\n","stdout":""}}
worker_1 | {"timestamp":"2021-01-18T13:19:06.540741000Z","level":"error","source":"baggageclaim","message":"baggageclaim.failed-to-set-up-driver","data":{"error":"failed to create btrfs filesystem: exit status 1"}}
worker_1 | error: failed to create btrfs filesystem: exit status 1
concourse-docker_worker_1 exited with code 1
I'm running into this issue on #70 but I'm not using docker swarm, just docker.
moby/moby#24862
Looks like this wont be solved anytime soon.
I've managed to get a little further by replacing privileged: true
with cap_add: [NET_ADMIN]
and setting CONCOURSE_RUNTIME
to containerd
I'm now stuck on the following error:
{"timestamp":"2022-01-28T19:11:04.686241153Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"operation not permitted","handle":"cbd0b4dd-84f8-4a9d-4b01-8ac8c27a968e","privileged":true,"session":"4.1.10","strategy":{"type":"import","path":"/usr/local/concourse/resource-types/docker-image/rootfs.tgz","follow_symlinks":false}}}
Which shows up as run check: find or create container on worker 3272415d73a2: failed to create volume
on web ui.
It might be because I'm trying to create a privileged docker-image container.
I've just brought up worker service in docker swarm successfully with sysbox-runc. But it requires me to set the default runtime of nodes that will run worker containers because docker stack does not suppport the runtime
prop on docker-compose.yml
:
# cat /etc/docker/daemon.json
{
"runtimes": {
"sysbox-runc": {
"path": "/usr/bin/sysbox-runc"
}
},
"default-runtime": "sysbox-runc"
}
When I try running the hello-world
example pipeline from the doc, I get a similar error:
run check: find or create container on worker dc72cdcf8d3d: failed to create volume
However, the reason showed in logs is strange:
concourse_worker.0.j6y38t6ei8o7@swarm-2 | {"timestamp":"2022-06-24T15:24:51.529209021Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"invalid argument","handle":"7c77c360-c3ed-46aa-62bc-dae9695f43b6","privileged":false,"session":"4.1.87","strategy":{"type":"cow","volume":"df398c77-1fcc-42d2-5987-77b450071893"}}}
It's not something about permissions, but "invalid argument"
.
Here's my docker-compose.yml:
version: '3.9'
services:
web:
image: concourse/concourse
command: web
ports:
- published: 8084
target: 8080
mode: host
networks:
- concourse
deploy:
mode: global
placement:
constraints:
- "node.role == manager"
secrets:
- authorized_worker_keys
- session_signing_key
- tsa_host_key
- tsa_host_key.pub
environment:
CONCOURSE_EXTERNAL_URL: https://concourse.xxxxxxxxxxxx.com
CONCOURSE_POSTGRES_HOST: xxxxxxxxxxxx
CONCOURSE_POSTGRES_USER: concourse
CONCOURSE_POSTGRES_PASSWORD: xxxxxxxxxxxx
CONCOURSE_POSTGRES_DATABASE: concourse
CONCOURSE_ADD_LOCAL_USER: balthild:xxxxxxxxxxxx
CONCOURSE_MAIN_TEAM_LOCAL_USER: balthild
CONCOURSE_SESSION_SIGNING_KEY: /run/secrets/session_signing_key
CONCOURSE_TSA_AUTHORIZED_KEYS: /run/secrets/authorized_worker_keys
CONCOURSE_TSA_HOST_KEY: /run/secrets/tsa_host_key
CONCOURSE_TSA_PUBLIC_KEY: /run/secrets/tsa_host_key.pub
logging:
driver: "json-file"
options:
max-file: "5"
max-size: "10m"
worker:
image: concourse/concourse
command: worker
networks:
- concourse
#privileged: true
#runtime: sysbox-runc
depends_on: [web]
stop_signal: SIGUSR2
deploy:
mode: global
placement:
constraints:
- "node.role != manager"
secrets:
- tsa_host_key.pub
- worker_key
- worker_key.pub
environment:
CONCOURSE_TSA_PUBLIC_KEY: /run/secrets/tsa_host_key.pub
CONCOURSE_TSA_WORKER_PRIVATE_KEY: /run/secrets/worker_key
CONCOURSE_TSA_HOST: web:2222
CONCOURSE_RUNTIME: containerd
CONCOURSE_BIND_IP: 0.0.0.0
CONCOURSE_BAGGAGECLAIM_BIND_IP: 0.0.0.0
# avoid using loopbacks
CONCOURSE_BAGGAGECLAIM_DRIVER: overlay
# work with docker-compose's dns
CONCOURSE_CONTAINERD_DNS_PROXY_ENABLE: "true"
logging:
driver: "json-file"
options:
max-file: "5"
max-size: "10m"
secrets:
session_signing_key:
file: ./keys/web/session_signing_key
authorized_worker_keys:
file: ./keys/web/authorized_worker_keys
tsa_host_key:
file: ./keys/web/tsa_host_key
tsa_host_key.pub:
file: ./keys/web/tsa_host_key.pub
worker_key:
file: ./keys/worker/worker_key
worker_key.pub:
file: ./keys/worker/worker_key.pub
networks:
concourse:
driver: overlay
It seems that the invalid argument
error is related to #42. But the workaround mentioned there (mount a volume to /worker-state
) does not work for me.
Update: The real message describes the actual error is produced by kernel, and it can be viewed with journalctl -f
.
Jun 24 16:28:48 swarm-2 kernel: overlayfs: idmapped layers are currently not supported
It's said that the support for idmapped layers in overlayfs will be available in Linux 5.19 (current mainline kernel is 5.18).