CoreDNS is broken on some installation
yhaliaw opened this issue · 6 comments
Summary
The CoreDNS does not always work.
From our experience, after microk8s is installed it sometimes works and sometimes not.
The CoreDNS will fail with nslookup
:
$ nslookup api.charmhub.io 10.152.183.10
;; Got SERVFAIL reply from 10.152.183.10
Server: 10.152.183.10
Address: 10.152.183.10#53
** server can't find api.charmhub.io: SERVFAIL
The CoreDNS logs will have the following:
[INFO] 127.0.0.1:37190 - 37245 "HINFO IN 4196468728056053778.1290062554275457168. udp 57 false 512" - - 0 6.002020584s
[ERROR] plugin/errors: 2 4196468728056053778.1290062554275457168. HINFO: read udp 10.1.53.2:40938->{PRIVATE-DNS-IP}:53: i/o timeout
For faulty CoreDNS issue, it persist. Every request fails.
Using nslookup
on the private DNS always succeed on the same machine.
Restarting the CoreDNS pod with kubectl rollout restart -n kube-system deployment/coredns
does not resolve the issue.
Re-enabling dns of microk8s does not resolve this issue.
However, reinstall microk8s can resolve this issue on the same machine.
sudo snap remove microk8s --purge
sudo snap install microk8s --channel=1.31-strict/stable
sudo microk8s enable dns
What Should Happen Instead?
The nslookup
on the coredns (10.152.183.10) should always succeed. It should not be broken in some installations.
Reproduction Steps
We are running jobs in a self-hosted runners. When we run multiple runs of this workflow the resulting microk8s installation might have faulty coreDNS.
The part of the workflow that has microk8s related commands:
timeout-minutes: 5
run: |
sudo apt-get update
sudo apt-get install retry -y
sudo snap install microk8s --channel='1.31-strict/stable'
sudo adduser "$USER" 'snap_microk8s'
- name: (IS hosted) Configure microk8s Docker Hub mirror
timeout-minutes: 5
run: |
sudo tee /var/snap/microk8s/current/args/certs.d/docker.io/hosts.toml << EOF
server = "$DOCKERHUB_MIRROR"
[host."${DOCKERHUB_MIRROR#'https://'}"]
capabilities = ["pull", "resolve"]
EOF
sudo microk8s stop
sudo microk8s start
- name: Set up microk8s
id: microk8s-setup
timeout-minutes: 15
run: |
# `newgrp` does not work in GitHub Actions; use `sg` instead
sg 'snap_microk8s' -c "microk8s status --wait-ready"
sg 'snap_microk8s' -c "retry --times 3 --delay 5 -- sudo microk8s enable dns"
sg 'snap_microk8s' -c "microk8s status --wait-ready"
sg 'snap_microk8s' -c "microk8s.kubectl rollout status --namespace kube-system --watch --timeout=5m deployments/coredns"
sg 'snap_microk8s' -c "retry --times 3 --delay 5 -- sudo microk8s enable hostpath-storage"
sg 'snap_microk8s' -c "microk8s.kubectl rollout status --namespace kube-system --watch --timeout=5m deployments/hostpath-provisioner"
mkdir ~/.kube/
# Used by lightkube and kubernetes (Python package)
sg 'snap_microk8s' -c "microk8s config > ~/.kube/config"
- run: sudo snap install juju
- run: sudo usermod -a -G snap_microk8s ubuntu
- run: sg 'snap_microk8s' -c "juju bootstrap microk8s"
- run: juju add-model test
The juju installation and setup might not be relevant to this bug.
Once we ssh into the runner with microk8s that has failure, the nslookup api.charmhub.io 10.152.183.10
will fail every time, and reinstall microk8s resolves the issue.
Introspection Report
Can you suggest a fix?
Are you interested in contributing with a fix?
Hey @yhaliaw,
Did you have a chance to check if dns queries for services or pods also fail in the "broken" installation? Can you also try setting the upstream forward server(s) manually microk8s enable dns:<comma-seperated-list-of-ips>
?
Did you have a chance to check if dns queries for services or pods also fail in the "broken" installation?
No, I did not tried this.
Can you also try setting the upstream forward server(s) manually microk8s enable dns:?
Which upstream forward server should I try?
For the broken installation, the coredns logs shows:
[INFO] 127.0.0.1:37190 - 37245 "HINFO IN 4196468728056053778.1290062554275457168. udp 57 false 512" - - 0 6.002020584s
[ERROR] plugin/errors: 2 4196468728056053778.1290062554275457168. HINFO: read udp 10.1.53.2:40938->{PRIVATE-DNS-IP}:53: i/o timeout
The private-dns-ip is the correct DNS used by the VM. Manually querying the DNS with nslookup
is successfuly every time. While testing with nslookup
on coreDNS of the broken installation fails every time with the above error message.
From the data platform team, they are also waiting for rollout after enabling the DNS plugin:
sudo microk8s enable dns
microk8s status --wait-ready
microk8s.kubectl rollout status --namespace kube-system --watch --timeout=5m deployments/coredns
Did you have a chance to check if dns queries for services or pods also fail in the "broken" installation?
Do you have command, I can run for this? Or some instructions.
Which services or pods should I use?
Hi @yhaliaw,
the issue with the flaky tests are missing iptable entries necessary for our MicroK8s pods to communicate successfully.
The flakes seem to occur when we install MicroK8s and stop it before it is ready and the iptable entries did not get populated yet. Adding sudo microk8s status --wait-ready
after the installation removes the flake for me.
Wishing you a happy CI from now on!
Louise
I have tested the your recommendation. It seems to work fine, so I am closing this issue.