CoreDNS is broken on some installation

Question

CoreDNS is broken on some installation

yhaliaw opened this issue 5 months ago · 6 comments

Summary

The CoreDNS does not always work.
From our experience, after microk8s is installed it sometimes works and sometimes not.
The CoreDNS will fail with nslookup:

$ nslookup api.charmhub.io 10.152.183.10                                                                                                                                             
;; Got SERVFAIL reply from 10.152.183.10
Server:         10.152.183.10
Address:        10.152.183.10#53

** server can't find api.charmhub.io: SERVFAIL

The CoreDNS logs will have the following:

[INFO] 127.0.0.1:37190 - 37245 "HINFO IN 4196468728056053778.1290062554275457168. udp 57 false 512" - - 0 6.002020584s
[ERROR] plugin/errors: 2 4196468728056053778.1290062554275457168. HINFO: read udp 10.1.53.2:40938->{PRIVATE-DNS-IP}:53: i/o timeout

For faulty CoreDNS issue, it persist. Every request fails.
Using nslookup on the private DNS always succeed on the same machine.
Restarting the CoreDNS pod with kubectl rollout restart -n kube-system deployment/coredns does not resolve the issue.
Re-enabling dns of microk8s does not resolve this issue.

However, reinstall microk8s can resolve this issue on the same machine.

sudo snap remove microk8s --purge
sudo snap install microk8s --channel=1.31-strict/stable
sudo microk8s enable dns

What Should Happen Instead?

The nslookup on the coredns (10.152.183.10) should always succeed. It should not be broken in some installations.

Reproduction Steps

We are running jobs in a self-hosted runners. When we run multiple runs of this workflow the resulting microk8s installation might have faulty coreDNS.

The part of the workflow that has microk8s related commands:

        timeout-minutes: 5
        run: |
          sudo apt-get update
          sudo apt-get install retry -y
          sudo snap install microk8s --channel='1.31-strict/stable'
          sudo adduser "$USER" 'snap_microk8s'
      - name: (IS hosted) Configure microk8s Docker Hub mirror
        timeout-minutes: 5
        run: |
          sudo tee /var/snap/microk8s/current/args/certs.d/docker.io/hosts.toml << EOF
          server = "$DOCKERHUB_MIRROR"
          [host."${DOCKERHUB_MIRROR#'https://'}"]
          capabilities = ["pull", "resolve"]
          EOF
          sudo microk8s stop
          sudo microk8s start
      - name: Set up microk8s
        id: microk8s-setup
        timeout-minutes: 15
        run: |
          # `newgrp` does not work in GitHub Actions; use `sg` instead
          sg 'snap_microk8s' -c "microk8s status --wait-ready"
          sg 'snap_microk8s' -c "retry --times 3 --delay 5 -- sudo microk8s enable dns"
          sg 'snap_microk8s' -c "microk8s status --wait-ready"
          sg 'snap_microk8s' -c "microk8s.kubectl rollout status --namespace kube-system --watch --timeout=5m deployments/coredns"
          sg 'snap_microk8s' -c "retry --times 3 --delay 5 -- sudo microk8s enable hostpath-storage"
          sg 'snap_microk8s' -c "microk8s.kubectl rollout status --namespace kube-system --watch --timeout=5m deployments/hostpath-provisioner"
          mkdir ~/.kube/
          # Used by lightkube and kubernetes (Python package)
          sg 'snap_microk8s' -c "microk8s config > ~/.kube/config"
      - run: sudo snap install juju
      - run: sudo usermod -a -G snap_microk8s ubuntu
      - run: sg 'snap_microk8s' -c "juju bootstrap microk8s"
      - run: juju add-model test

The juju installation and setup might not be relevant to this bug.
Once we ssh into the runner with microk8s that has failure, the nslookup api.charmhub.io 10.152.183.10 will fail every time, and reinstall microk8s resolves the issue.

Introspection Report

Can you suggest a fix?

Are you interested in contributing with a fix?

Answer 1 · 2024-09-30T08:16:08.000Z

Hey @yhaliaw,

Did you have a chance to check if dns queries for services or pods also fail in the "broken" installation? Can you also try setting the upstream forward server(s) manually microk8s enable dns:<comma-seperated-list-of-ips>?

Answer 2 · 2024-09-30T13:48:48.000Z

Did you have a chance to check if dns queries for services or pods also fail in the "broken" installation?

No, I did not tried this.

Can you also try setting the upstream forward server(s) manually microk8s enable dns:?

Which upstream forward server should I try?
For the broken installation, the coredns logs shows:

[INFO] 127.0.0.1:37190 - 37245 "HINFO IN 4196468728056053778.1290062554275457168. udp 57 false 512" - - 0 6.002020584s
[ERROR] plugin/errors: 2 4196468728056053778.1290062554275457168. HINFO: read udp 10.1.53.2:40938->{PRIVATE-DNS-IP}:53: i/o timeout

The private-dns-ip is the correct DNS used by the VM. Manually querying the DNS with nslookup is successfuly every time. While testing with nslookup on coreDNS of the broken installation fails every time with the above error message.

Answer 3 · 2024-09-30T15:04:18.000Z

From the data platform team, they are also waiting for rollout after enabling the DNS plugin:

sudo microk8s enable dns
microk8s status --wait-ready
microk8s.kubectl rollout status --namespace kube-system --watch --timeout=5m deployments/coredns

Answer 4 · 2024-10-01T09:35:42.000Z

Did you have a chance to check if dns queries for services or pods also fail in the "broken" installation?

Do you have command, I can run for this? Or some instructions.
Which services or pods should I use?

Answer 5 · 2024-10-10T13:08:32.000Z

Hi @yhaliaw,

the issue with the flaky tests are missing iptable entries necessary for our MicroK8s pods to communicate successfully.
The flakes seem to occur when we install MicroK8s and stop it before it is ready and the iptable entries did not get populated yet. Adding sudo microk8s status --wait-ready after the installation removes the flake for me.

Wishing you a happy CI from now on!
Louise

Answer 6 · 2024-10-11T01:39:19.000Z

I have tested the your recommendation. It seems to work fine, so I am closing this issue.