tazjin/kubernetes-letsencrypt

Detecting Google Cloud DNS zone

Closed this issue · 12 comments

ahume commented

Hi,

I think this is more likely something I don't quite follow with the auth/challenge flow, but here is the behaviour we're seeing in Google Cloud.

  • I create a service with acme/certificate: lb.flags0.gcp0.example.net.
  • The controller creates a TXT record at _acme-challenge.lb.flags0.gcp0.example.net..
  • letsencrypt fails to find the above record, because it is querying for _acme-challenge.gcp0.example.net.

I've worked around this by copying the digest from the lb.flags0 TXT record into the zone at gcp0.example.net (which is a different GCP project), but I presume there is something going wrong in the flow here.

Hey!

The controller will currently attempt to use the most specific matching zone, which I assume in your case is either flags0.gcp0.example.net or lb.flags0.gcp0.example.net.

It logs the zone it detected after making the change (log.info("Waiting for change in zone {} to finish. This may take some time.", result.zone());).

The authentication flow uses the specified subdomain, so the record that the controller creates is correct. I'm confused about why LE would query for the record at a different level, that does not sound correct. I'll do some tests to try and reproduce it.

Questions:

  1. Can you confirm that the detected zone is the zone that you expect it to use?
  2. Where did you see that Let's Encrypt is querying the record without the lb. prefix?

Another thought:

I've worked around this by copying the digest from the lb.flags0 TXT record into the zone at gcp0.example.net (which is a different GCP project), but I presume there is something going wrong in the flow here.

Which TXT record did you create in the gcp0.example.net zone? Theoretically LE polling should fail if the actual zone is delegated correctly.

ahume commented
[Thread] INFO in.tazj.k8s.letsencrypt.acme.CloudDnsResponder - Waiting for change in zone flags0-gcp0-example-net to finish. This may take some time.
[Thread] INFO in.tazj.k8s.letsencrypt.util.DnsRecordObserver - Waiting for DNS record '_acme-challenge.lb.flags0.gcp0.example.net' update

The detected zone above (flags0-gcp0-example-net) is what I expected, yes. And that's where the _acme-challenge.lb... was created.

However, the challenge was failing,

[Thread] ERROR in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler - Challenge https://acme-staging.api.letsencrypt.org/acme/challenge/<example_path>/<example_id> failed
Exception in thread "Thread" in.tazj.k8s.letsencrypt.util.LetsencryptException: Failed due to invalid challenge

and the response from the LE end-point (the URL in the error above) included...

"error": {
    "type": "urn:acme:error:connection",
    "detail": "DNS problem: SERVFAIL looking up CAA for gcp0.example.net",
    "status": 400
  },

Which led me to speculate that copying the TXT record that had been created over to the gcp0.example.net. zone might make it work. Which it did.

@ahume just FYI, the logged link contains the true domain name, if that's sensitive you may want to remove it!

ahume commented

Yeah, thank-you - I'll change it.

Do you actually use CAA records?

More questions, sorry :)

Just to make sure: The TXT record you copied over was for the FQDN of the challenge, including the lb.flags0.gcp0. bit?

I've tested this with multiple levels of delegation on a test domain and can't reproduce it (yet). The error message about an actual SERVFAIL (and for a different record type than the challenge!) is suspicious though.

While the controller creates & validates the TXT record, can you try polling it with dig TXT @8.8.8.8 +trace _acme-challenge.lb.flags0.gcp0.example.net. to see if the response is correct there?

What mainly confuses me is how adding the record in a zone that is not authoritative for the requested subdomain could make it work

ahume commented

Hugely appreciate your time on this. I'm going to check through our DNS stuff again tomorrow with someone who understands better. Will get back with answers for you then. There's every chance I've messed up some of the configuration at this end.

n/p, I use this in production in multiple places so weeding out potential issues is important for me :)

Alright, let me know if you find anything tomorrow!

ahume commented

Retried all this with a fresh domain this morning and it's working. Our only conclusion is that there was something going on with DNS caching somewhere for that domain (as I'd been using it for some failed testing earlier in the week). My bad. Thanks again for all your quick responses, etc.

Glad to hear it solved itself. Cheers!