error using DIND_DAEMON_JSON_FILE
TrentonAdams opened this issue · 9 comments
I don't think I'm using this incorrectly. My reason for setting a daemon.json was to have insecure registries. However, I now know I can set a separate variable for that...
$ export DIND_DAEMON_JSON_FILE="$(pwd)/daemon.json"
$ dind-cluster-v1.13.sh up
WARNING: No swap limit support
WARNING: No swap limit support
WARNING: No swap limit support
WARNING: No swap limit support
* Making sure DIND image is up to date
sha256:0fcb655948a1fa20f5a2100983755edc8f0d763248bda217b3454d82d5cd3be4: Pulling from mirantis/kubeadm-dind-cluster
Digest: sha256:0fcb655948a1fa20f5a2100983755edc8f0d763248bda217b3454d82d5cd3be4
Status: Image is up to date for mirantis/kubeadm-dind-cluster@sha256:0fcb655948a1fa20f5a2100983755edc8f0d763248bda217b3454d82d5cd3be4
* Starting DIND container: kube-master
Job for docker.service canceled.
Anyhow, I narrowed it down to...
dind-cluster-v1.13.sh
2277 docker exec ${container_id} systemctl restart docker
The same thing happens when you use...
export DIND_INSECURE_REGISTRIES="[\"0.0.0.0/0\"]"
To work around it, I had to remove the docker exec ${container_id} systemctl restart docker
line in dind::custom-docker-opts
. Then, after the cluster starts up I use dind-cluster-v1.13.sh down; dind-cluster-v1.13.sh up
. Essentially...
export DIND_INSECURE_REGISTRIES="[\"0.0.0.0/0\"]"
dind-cluster-v1.13.sh up && dind-cluster-v1.13.sh down && dind-cluster-v1.13.sh up
Hey, I got exactly the same problem and almost the same workaround! :-)
Still have no idea why it's failed. I can recreate the issue and see the error when I go into the container and run the systemctl restart docker
manually.
Hey @TrentonAdams, after a few hours struggling, I think I've got the idea why! CC @jc-sanchez who previously fixed similar issue. And, @ivan4th who seems to be the most active contributor to this repo in recent months :-)
Actually the issue should have been fixed somehow months ago because I see someone reported similar issue #266, and corresponding PR #271 been merged already! This comment explains mostly the root cause.
However, there are couple of problems:
-
The fix has never been included in public images! This should go into
mirantis/kubeadm-dind-cluster:bare-v4
, then inherited bymirantis/kubeadm-dind-cluster:<commit>-v1.xx
. But I checked the latest image on Docker Hub:mirantis/kubeadm-dind-cluster:dd4966877e3a421238a538a525172c4162b7554d-v1.13
. That was pushed a few days ago, and it's still the old code, because I can still seemkdir -p /dind/containerd
inwrapkubeadm
, while it's supposed to be moved toDockerfile
per the fix. I don't know how that's getting pushed to Docker Hub, because I cannot find the bare-v4 image there, probably it's local or somewhere internally? Anyway, seems that bare-v4 image needs to be updated. -
It seems to move
mkdir -p /dind/containerd
out ofwrapkubeadm
is not sufficient, because there's another error when restart docker.service before that:Mar 10 02:11:47 kube-master modprobe[57]: modprobe: FATAL: Module overlay not found in directory /lib/modules/4.9.125-linuxkit
This is because of the missing4.9.125-linuxkit
, which can be fixed by runningtar -C / -xf /dind-sys/sys.tar
in the container. It can be found inwrapkubeadm
code.
So, as work around before it's getting fixed officially, after I added the following two lines in dind::custom-docker-opts
, the issue is gone!
function dind::custom-docker-opts {
...
if [[ ${got_changes} ]] ; then
...
# work around: prereqs before restart docker service
docker exec ${container_id} tar -C / -xf /dind-sys/sys.tar
docker exec ${container_id} mkdir -p /dind/containerd
docker exec ${container_id} systemctl daemon-reload
docker exec ${container_id} systemctl restart docker
fi
}
More updates: Now I figured that this issue can be completely resolved w/o making any change as workaround:-)
To my understand, the CI is keeping pushing images w/ corresponding git commit hash to Docker Hub. So, when I use the latest one at the time when I was writing this comment, tag dd4966877e3a421238a538a525172c4162b7554d-v1.1x
, it should include @jc-sanchez's fix. I've verified locally that works perfect... no need to untar /dind-sys/sys.tar anymore.
Just one thing remained: Both the pre-configured scripts in fixed
folder and those from GitHub release are a bit old. It's said those in fixed
are deprecated, but the latest release is also 3 months ago. The only reasonable resolution to me before we get those scripts updated is to run build/genfixed.sh
locally to re-generate the pre-configured scripts on master branch which will include the fix. But it looks this approach is not documented.
With that, I am curious when we can bump up the release or the fixed
scripts, or if we could add the way to run genfxied.sh
to README.md. I'm seeing quite a few people reporting the same issue, so, believe that will save people's time pretty much! And, if needed, I'd be happy to help on that since I've spent quite a lot hours on it:-) Thx @pigmej @jc-sanchez @ivan4th
So, as work around before it's getting fixed officially, after I added the following two lines in
dind::custom-docker-opts
, the issue is gone!function dind::custom-docker-opts { ... if [[ ${got_changes} ]] ; then ... # work around: prereqs before restart docker service docker exec ${container_id} tar -C / -xf /dind-sys/sys.tar docker exec ${container_id} mkdir -p /dind/containerd docker exec ${container_id} systemctl daemon-reload docker exec ${container_id} systemctl restart docker fi }
Cannot get this to work with most recent dind-cluster-v1.13.sh freshly wget'ed from this repo. The offending part after changes (unfortunately ambiguously described above) look like this for me:
if [[ ${got_changes} ]] ; then
local json=$(IFS="+"; echo "${jq[*]}")
docker exec -i ${container_id} /bin/sh -c "mkdir -p /etc/docker && jq -n '${json}' > /etc/docker/daemon.json"
docker exec ${container_id} tar -C / -xf /dind-sys/sys.tar
docker exec ${container_id} mkdir -p /dind/containerd
docker exec ${container_id} systemctl daemon-reload
docker exec ${container_id} systemctl restart docker
fi
Still getting "Job for docker.service canceled."
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
I think this has been fixed w/ the new released pre-configured scripts, e.g. https://github.com/kubernetes-sigs/kubeadm-dind-cluster/releases/download/v0.2.0/dind-cluster-v1.14.sh
/close
@morningspace: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.