kubernetes-retired/kubeadm-dind-cluster

error using DIND_DAEMON_JSON_FILE

TrentonAdams opened this issue · 9 comments

I don't think I'm using this incorrectly. My reason for setting a daemon.json was to have insecure registries. However, I now know I can set a separate variable for that...

$ export DIND_DAEMON_JSON_FILE="$(pwd)/daemon.json"
$ dind-cluster-v1.13.sh up
WARNING: No swap limit support
WARNING: No swap limit support
WARNING: No swap limit support
WARNING: No swap limit support
* Making sure DIND image is up to date 
sha256:0fcb655948a1fa20f5a2100983755edc8f0d763248bda217b3454d82d5cd3be4: Pulling from mirantis/kubeadm-dind-cluster
Digest: sha256:0fcb655948a1fa20f5a2100983755edc8f0d763248bda217b3454d82d5cd3be4
Status: Image is up to date for mirantis/kubeadm-dind-cluster@sha256:0fcb655948a1fa20f5a2100983755edc8f0d763248bda217b3454d82d5cd3be4
* Starting DIND container: kube-master
Job for docker.service canceled.

Anyhow, I narrowed it down to...
dind-cluster-v1.13.sh
2277 docker exec ${container_id} systemctl restart docker

The same thing happens when you use...

export DIND_INSECURE_REGISTRIES="[\"0.0.0.0/0\"]"

To work around it, I had to remove the docker exec ${container_id} systemctl restart docker line in dind::custom-docker-opts. Then, after the cluster starts up I use dind-cluster-v1.13.sh down; dind-cluster-v1.13.sh up. Essentially...

export DIND_INSECURE_REGISTRIES="[\"0.0.0.0/0\"]"
dind-cluster-v1.13.sh up && dind-cluster-v1.13.sh down && dind-cluster-v1.13.sh up

Hey, I got exactly the same problem and almost the same workaround! :-)

Still have no idea why it's failed. I can recreate the issue and see the error when I go into the container and run the systemctl restart docker manually.

Hey @TrentonAdams, after a few hours struggling, I think I've got the idea why! CC @jc-sanchez who previously fixed similar issue. And, @ivan4th who seems to be the most active contributor to this repo in recent months :-)

Actually the issue should have been fixed somehow months ago because I see someone reported similar issue #266, and corresponding PR #271 been merged already! This comment explains mostly the root cause.

However, there are couple of problems:

  • The fix has never been included in public images! This should go into mirantis/kubeadm-dind-cluster:bare-v4, then inherited by mirantis/kubeadm-dind-cluster:<commit>-v1.xx. But I checked the latest image on Docker Hub: mirantis/kubeadm-dind-cluster:dd4966877e3a421238a538a525172c4162b7554d-v1.13. That was pushed a few days ago, and it's still the old code, because I can still see mkdir -p /dind/containerd in wrapkubeadm, while it's supposed to be moved to Dockerfile per the fix. I don't know how that's getting pushed to Docker Hub, because I cannot find the bare-v4 image there, probably it's local or somewhere internally? Anyway, seems that bare-v4 image needs to be updated.

  • It seems to move mkdir -p /dind/containerd out of wrapkubeadm is not sufficient, because there's another error when restart docker.service before that: Mar 10 02:11:47 kube-master modprobe[57]: modprobe: FATAL: Module overlay not found in directory /lib/modules/4.9.125-linuxkit This is because of the missing 4.9.125-linuxkit, which can be fixed by running tar -C / -xf /dind-sys/sys.tar in the container. It can be found in wrapkubeadm code.

So, as work around before it's getting fixed officially, after I added the following two lines in dind::custom-docker-opts, the issue is gone!

function dind::custom-docker-opts {
  ...
  if [[ ${got_changes} ]] ; then
    ...
    # work around: prereqs before restart docker service
    docker exec ${container_id} tar -C / -xf /dind-sys/sys.tar
    docker exec ${container_id} mkdir -p /dind/containerd
    docker exec ${container_id} systemctl daemon-reload
    docker exec ${container_id} systemctl restart docker
  fi
}

More updates: Now I figured that this issue can be completely resolved w/o making any change as workaround:-)

To my understand, the CI is keeping pushing images w/ corresponding git commit hash to Docker Hub. So, when I use the latest one at the time when I was writing this comment, tag dd4966877e3a421238a538a525172c4162b7554d-v1.1x, it should include @jc-sanchez's fix. I've verified locally that works perfect... no need to untar /dind-sys/sys.tar anymore.

Just one thing remained: Both the pre-configured scripts in fixed folder and those from GitHub release are a bit old. It's said those in fixed are deprecated, but the latest release is also 3 months ago. The only reasonable resolution to me before we get those scripts updated is to run build/genfixed.sh locally to re-generate the pre-configured scripts on master branch which will include the fix. But it looks this approach is not documented.

With that, I am curious when we can bump up the release or the fixed scripts, or if we could add the way to run genfxied.sh to README.md. I'm seeing quite a few people reporting the same issue, so, believe that will save people's time pretty much! And, if needed, I'd be happy to help on that since I've spent quite a lot hours on it:-) Thx @pigmej @jc-sanchez @ivan4th

So, as work around before it's getting fixed officially, after I added the following two lines in dind::custom-docker-opts, the issue is gone!

function dind::custom-docker-opts {
  ...
  if [[ ${got_changes} ]] ; then
    ...
    # work around: prereqs before restart docker service
    docker exec ${container_id} tar -C / -xf /dind-sys/sys.tar
    docker exec ${container_id} mkdir -p /dind/containerd
    docker exec ${container_id} systemctl daemon-reload
    docker exec ${container_id} systemctl restart docker
  fi
}

Cannot get this to work with most recent dind-cluster-v1.13.sh freshly wget'ed from this repo. The offending part after changes (unfortunately ambiguously described above) look like this for me:

  if [[ ${got_changes} ]] ; then
    local json=$(IFS="+"; echo "${jq[*]}")
    docker exec -i ${container_id} /bin/sh -c "mkdir -p /etc/docker && jq -n '${json}' > /etc/docker/daemon.json"
    docker exec ${container_id} tar -C / -xf /dind-sys/sys.tar
    docker exec ${container_id} mkdir -p /dind/containerd
    docker exec ${container_id} systemctl daemon-reload
    docker exec ${container_id} systemctl restart docker
  fi

Still getting "Job for docker.service canceled."

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

I think this has been fixed w/ the new released pre-configured scripts, e.g. https://github.com/kubernetes-sigs/kubeadm-dind-cluster/releases/download/v0.2.0/dind-cluster-v1.14.sh

/close

@morningspace: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.