kubernetes/dns

Upgrade to coreDNS 1.5 in node-cache

prameshj opened this issue · 21 comments

This is needed to pick up a bugfix coredns/coredns#2636

@prameshj can you test if our new image has the same problems with the cleanup that caused the revert:

pasientskyhosting/k8s-dns-node-cache-amd64:1.15.4-2-g903339e-coredns1.6.2-dirty

@prameshj can you test if our new image has the same problems with the cleanup that caused the revert:

pasientskyhosting/k8s-dns-node-cache-amd64:1.15.4-2-g903339e-coredns1.6.2-dirty

It still does not do the cleanup.. This is how i test it:

  1. Pick one of the node-local-dns pods and run "watch -n 1 'iptables -t raw -nvL'"
    This command watches the iptables rules that node-local-dns pod configures.
  2. Delete that specific node-local-dns pod with "kubectl delete pods "
  3. See the output of 1) if the rules are still there.

I see the rules are still there and get cleaned up when the next pod comes up. This is problematic when we disable nodelocaldns and the rules/interface are not cleaned up.

Last I had checked, like the shutdown callbacks in forward plugin are taking longer/not completing before pod is shut down. The node-cache cleanup would be called after all the shutdown callbacks of plugins. @johnbelamaric any ideas on how to debug this further?

Ok. I test it with the image and PREROUTING rules are not deleted, as you said

Chain PREROUTING (policy ACCEPT 4239 packets, 952K bytes)
    4   374 CT         udp  --  *      *       0.0.0.0/0            169.254.25.10        udp dpt:53 NOTRACK
    0     0 CT         tcp  --  *      *       0.0.0.0/0            169.254.25.10        tcp dpt:53 NOTRACK

I will try to debug a litt bit more.

Last I had checked, like the shutdown callbacks in forward plugin are taking longer/not completing before pod is shutdown

@prameshj How did you debug this part? I'm trying to see logs when pod is deleted but I can't access with --previous flag.

Thank you

I think i added logs to the different plugin shutdown callbacks and checked if they are invoked.
I started with logs here -

exitCode := executeShutdownCallbacks("SIGTERM")

then checked which callback completed, i recall forward plugin callback did not complete.

I also modified nodelocaldns yaml to include a terminationGracePeriodSeconds value of 60, but that did not help.

@jordips Is it possible for you to build this image with the latest CoreDNS - 1.6.3? https://github.com/coredns/coredns/releases

I talked to @johnbelamaric , he suggested we try with the latest version and if it still has the same behavior, we can open an issue on coredns repo.

@prameshj sure! I will try later and I tell you something.

@woopstar or @prameshj Is there any special procedure to manage packages with DEP? I change coreDNS version in Gopkg.toml and making a dep ensure it fails with missing packages from k8s.io (the ones inside "pkg" and "cmd")
Thanks

@chad-jones Is the one who made our image and had it work with 1.6.2. I can ask him to rebuild with 1.6.3 too

@woopstar that will be awesome! :) Or if @chad-jones explains the rebuild steps I can do it myself. I think the procedure is:

  • change Gopkg.toml (coredns version)
  • "dep ensure" or something like this to get new source to vendor folder.
  • "make containers" to build new image.

But I have problems with dep dependecies... I'm still solving this

@chad-jones mentioned for me that it was a dep nightmare when he did 1.6.2 :) I'll let him post here whatever he decides to do :)

@woopstar perfect. Thank you for the info. Good to know that i'm not the only one thinking is a nightmare 👍 :D

CoreDNS - 1.6.3
pasientskyhosting/k8s-dns-node-cache-amd64:1.15.4-coredns1.6.3-dirty

k/k and CoreDNS have both moved to go modules, is it worth making that change here.

Andreas Krüger perfect. Thank you for the info. Good to know that i'm not the only one thinking is a nightmare 👍 :D

I think the biggest issue is with the prometheus client version that skydns uses and what CoreDNS requires. When i tried updating to a newer CoreDNS, i had to manually edit the skydns vendor directory to use a newer prometheus client. Was this the issue you faced too?

I am not sure if moving to go modules will fix that particular issue.. but it is worth making the change.

I will try out the cleanup test with the new image and report back in a day or so. Thanks for sharing the image.

axot commented

I also tried to upgrade to 1.5.3 before and did not face the skydns issue. One more change I think was caddy renamed from github.com/mholt/caddy to github.com/caddyserver/caddy.

pasientskyhosting/k8s-dns-node-cache-amd64:1.15.4-coredns1.6.3-dirty

I just tried this.. The cleanup code is not invoked in the latest version either.

@woopstar @chad-jones I am trying to debug the cleanup issue by adding logs and building a custom image. I am having a hard time getting the image to build.

I changed Gopkg.toml to point to coredns 1.16.4 and client_golang to 1.1.0 which coredns needs. However, that dependency is not working when i do "dep ensure". Could you submit a PR with the changes to upgrade to 1.16? I can debug it further. Thanks!

I found the issue. coreDNS changed the import path for caddy in coredns/coredns#2961
node-cache was still using the old import path and setting the cleanup function as part of OnProcessExit. I changed it to use the right path and it is working fine now.
many thanks to @chad-jones for creating the PR with the dependencies update. I have submitted a PR with the changes, hope to merge it soon.

We are glad to help out with this issue! Thank you @prameshj for helping out on an official solution to this.