A modest proposal on solving DNS resolution/timing issues
AaronFriel opened this issue · 0 comments
Hi, I'd like to add a feature to resolve a problem I have, and seems to be shared by others. I'll learn whatever Go I need to in order to implement it, as well, I don't think the implementation would be that difficult, per se, but I do want to get some feedback from people who have developed against Kubernetes APIs on whether or not I'm even on the right track to solve this problem.
The problem
A race condition which kube-lego
almost always loses. Upon creating a new ingress/service with kube-lego
and external-dns
configured:
-
kube-lego
does a reachability test that involves a DNS lookup. The DNS lookup hits the cluster DNS, which (typically) uses the node's resolver or an upstream server. -
external-dns
creates DNS records in external services. This takes time to propagate and run.
In deploying dozens of services using CI/CD services with random subdomains, I've never seen kube-lego
win this race. To win, kube-lego
would have to lose by resolving its query after external-dns
updates the external provider, and issue its dns query after any possible negative caching has occurred and the upstream provider has updated its nameservers (possibly globally).
The result for me has been that kube-lego
can take north of ten minutes to deploy a certificate to a domain. Sometimes due to the repeated failed validations the certificates won't deploy for half an hour or more.
Solutions
These are possible implementation ideas for a solution, and I'd like to hear feedback on what's tenable, and what other folks would want to see implemented.
-
The simplest solution: add a --delay flag to
kube-lego
that takes a time parameter in seconds until it attempts to obtain certificates. Even 60 seconds might be sufficient. -
A hard (harder?) solution: have
external-dns
annotate or otherwise modify a service/ingress to indicate its name will successfully resolve. This is, however, messy. -
The best solution:
kube-lego
should add an annotation or otherwise modify a service/ingress to allowexternal-dns
to add the_acme-challenge
TXT record necessary. There's still a race condition here, but in my testing, TXT records update a lot more quickly and DNS validation will succeed on the second attempt almost always. This is better than negative caching resulting in 10+ validation failures, which can lead to throttling from the letsencrypt servers.
The first option seems like the most attractive one for me to implement, because I don't know much Go (but I don't think I'll have any trouble picking it up) and I don't know much about signaling/annotations/API stuff in Kubernetes. I'd like to learn that though. Are there any good resources?
I'd also like to know what the authors think about implementing the third option. In my own testing against Microsoft's Azure DNS server and against 8.8.8.8, the TXT record was visible within a few seconds of creating it, and subsequent changes to the TXT record were visible within a few seconds. This was much better than the negative caching time of a minute or more for A/AAAA/CNAME records.
What do the maintainers think, is this realistic?
(Regarding the third option, I'm tagging a few folks I saw making commits to external-dns
: @linki @hjacobs... I'm not sure who else to tag.)