caddy-dns/cloudflare

Error: "solving challenges: waiting for solver certmagic.solverWrapper to be ready: timed out waiting for record to fully propagate; verify DNS provider configuration is correct - last error: <nil>"

polarathene opened this issue · 1 comments

First time trying out the DNS provider feature. Tried with Caddyfile as below with scoped API token passed in as ENV var.

  • Waited a while with nothing happening until an error occurred.
  • Quick glance through existing issues, doesn't appear to be an earlier issue about it.
  • While raising this issue, the third attempt was successful, taking about 10-30 minutes to be successful (reproduced multiple times).

Is this expected behaviour of the feature with Cloudflare, or is the time taken and earlier failures a possible problem?

Environment Details:

  • The apex domain (example.com, no subdomain) has a reverse proxy (Traefik) that runs on a different server handling several sub-domains and a CNAME wildcard for non-explicit subdomains.
  • The Caddy instance is a fresh 2.4 binary from the download page with the Cloudflare module added, run via a fresh instance of Fedora 34 on Vultr.
  • Unlike when running Caddy via Docker, Fedora additionally required running a firewall command to allow HTTP/HTTPS traffic. Prior to trying the Cloudflare DNS module, a different subdomain was already configured on Cloudflare to the Vultr Fedora VPS IP and regular LetsEncrypt which worked successfully.

Caddyfile:

{
  acme_ca https://acme-staging-v02.api.letsencrypt.org/directory
  email admin@example.com
}

test.example.com {
  tls {
    dns cloudflare {env.CF_API_TOKEN}
  }

  root * /srv/site
  file_server
}

Caddy stdout log:

2021/05/22 06:18:10.447 INFO    serving initial configuration
2021/05/22 06:18:10.433 INFO    tls     cleaning storage unit   {"description": "FileStorage:/root/.local/share/caddy"}
2021/05/22 06:18:10.430 INFO    tls.cache.maintenance   started background certificate maintenance      {"cache": "0xc000260460"}
2021/05/22 06:18:10.449 INFO    tls.obtain      acquiring lock  {"identifier": "test.example.com"}
2021/05/22 06:18:10.451 INFO    tls.obtain      lock acquired   {"identifier": "test.example.com"}

2021/05/22 06:18:10.452 INFO    tls.issuance.acme       waiting on internal rate limiter        {"identifiers": ["test.example.com"]}
2021/05/22 06:18:10.452 INFO    tls.issuance.acme       done waiting on internal rate limiter   {"identifiers": ["test.example.com"]}
2021/05/22 06:18:10.454 INFO    tls     finished cleaning storage units
2021/05/22 06:18:12.092 INFO    tls.issuance.acme.acme_client   trying to solve challenge       {"identifier": "test.example.com", "challenge_type": "dns-01", "ca": "https://acme-staging-v02.api.letsencrypt.org/directory"}

2021/05/22 06:20:15.996 INFO    tls.issuance.acme       waiting on internal rate limiter        {"identifiers": ["test.example.com"]}
2021/05/22 06:20:15.997 INFO    tls.issuance.acme       done waiting on internal rate limiter   {"identifiers": ["test.example.com"]}
2021/05/22 06:20:17.104 INFO    tls.issuance.acme.acme_client   trying to solve challenge       {"identifier": "test.example.com", "challenge_type": "dns-01", "ca": "https://acme-staging-v02.api.letsencrypt.org/directory"}

2021/05/22 06:22:19.301 ERROR   tls.obtain      will retry      {"error": "[test.example.com] Obtain: [test.example.com] solving challenges: waiting for solver certmagic.solverWrapper to be ready: timed out waiting for record to fully propagate; verify DNS provider configuration is correct - last error: <nil> (order=https://acme-staging-v02.api.letsencrypt.org/acme/order/19618631/59618995) (ca=https://acme-staging-v02.api.letsencrypt.org/directory)", "attempt": 1, "retrying_in": 60, "elapsed": 248.850234978, "max_duration": 2592000}

2021/05/22 06:23:20.908 INFO    tls.issuance.acme.acme_client   trying to solve challenge       {"identifier": "test.example.com", "challenge_type": "dns-01", "ca": "https://acme-staging-v02.api.letsencrypt.org/directory"}
2021/05/22 06:25:24.293 INFO    tls.issuance.acme.acme_client   trying to solve challenge       {"identifier": "test.example.com", "challenge_type": "dns-01", "ca": "https://acme-staging-v02.api.letsencrypt.org/directory"}

2021/05/22 06:27:25.852 ERROR   tls.obtain      will retry      {"error": "[test.example.com] Obtain: [test.example.com] solving challenges: waiting for solver certmagic.solverWrapper to be ready: timed out waiting for record to fully propagate; verify DNS provider configuration is correct - last error: <nil> (order=https://acme-staging-v02.api.letsencrypt.org/acme/order/19618631/59621711) (ca=https://acme-staging-v02.api.letsencrypt.org/directory)", "attempt": 2, "retrying_in": 120, "elapsed": 555.400860953, "max_duration": 2592000}

2021/05/22 06:29:28.604 INFO    tls.issuance.acme.acme_client   trying to solve challenge       {"identifier": "test.example.com", "challenge_type": "dns-01", "ca": "https://acme-staging-v02.api.letsencrypt.org/directory"}
2021/05/22 06:29:31.836 INFO    tls.issuance.acme.acme_client   validations succeeded; finalizing order {"order": "https://acme-staging-v02.api.letsencrypt.org/acme/order/19618631/59623852"}
2021/05/22 06:29:32.512 INFO    tls.issuance.acme.acme_client   successfully downloaded available certificate chains    {"count": 1, "first_url": "https://acme-staging-v02.api.letsencrypt.org/acme/cert/fa59c8fe71cfcb679672d9cb9c6fda649b27"}
2021/05/22 06:29:32.514 INFO    tls.obtain      certificate obtained successfully       {"identifier": "test.example.com"}
2021/05/22 06:29:32.514 INFO    tls.obtain      releasing lock  {"identifier": "test.example.com"}

TL;DR

Is usage of this module expected to fail multiple times in a valid setup and take 10 minutes to be successful?

If not, could the cause be related to :

  • Server setup such as Fedora default security policies interfering?
  • Existing DNS configuration? (multiple servers for a given domain via A records on specific subdomains, and a * CNAME record to the A record of the apex domain example.com)

Changed wildcard CNAME record to point to an A record for the server running caddy, no improvement. Applies to both staging and production ACME configs. Behaviour is reproducible.

timed out waiting for record to fully propagate; verify DNS provider configuration is correct - last error: <nil>
  • I looked at the Cloudflare dashboard and can see the _acme-challenge.test TXT record, this value changes each time after an error retry (60 seconds).
  • I have queried the challenge TXT record via https://dnschecker.org/#TXT/ which shows success on all checks around the world while Caddy fails to query the record successfully.
  • On my local system querying the DNS record also works.

Investigated the DNS query from within the Vultr VPS:

dig:

dig TXT _acme-challenge.test.example.com

; <<>> DiG 9.16.11-RedHat-9.16.11-5.fc34 <<>> TXT _acme-challenge.test.example.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 31271
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;_acme-challenge.test.example.com. IN        TXT

;; ANSWER SECTION:
_acme-challenge.test.example.com. 269 IN CNAME vultr-vps.example.com.

;; AUTHORITY SECTION:
example.com.      269     IN      SOA     mia.ns.cloudflare.com. dns.cloudflare.com. 2037333583 10000 2400 604800 3600
example.com.      1436    IN      SOA     mia.ns.cloudflare.com. dns.cloudflare.com. 2037333381 10000 2400 604800 3600

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Sat May 22 07:44:29 UTC 2021
;; MSG SIZE  rcvd: 174

dog:

$ dog _acme-challenge.test.example.com TXT
CNAME _acme-challenge.test.example.com. 3m39s   "vultr-vps.example.com."
  SOA example.com.                   3m39s A "mia.ns.cloudflare.com." "dns.cloudflare.com." 2037333631 2h46m40s 40m00s 7d0h00m00s 1h00m00s

$ dog _acme-challenge.test.example.com TXT @1.1.1.1
TXT _acme-challenge.test.example.com. 5m00s   "sPxh-D3aWfGismO-Y4w9iRzhDyb5btdPH126rDYczNM"

systemd resolvectl: (appears to be what the VPS was using for managing DNS)

$ cat /etc/resolv.conf | grep nameserver
nameserver 127.0.0.53

$ resolvectl dns
Global:
Link 2 (ens3): 108.61.10.10
Link 3 (docker0):

$ resolvectl --cache no -t TXT query _acme-challenge.test.example.com
_acme-challenge.test.example.com: resolve call failed: Name 'vultr-vps.example.com' does not have any RR of the requested type

$ resolvectl flush-caches

$ resolvectl --cache no -t TXT query _acme-challenge.test.example.com
_acme-challenge.test.example.com IN TXT "iUXEbzj8ZFKRMy-lsPIm-BzKk2NstsykMzVp6G_xTDo" -- link: ens3

-- Information acquired via protocol DNS in 1.6ms.
-- Data is authenticated: no; Data was acquired via local or encrypted transport: no
-- Data from: network
  • Querying with a different DNS resolver than whatever Vultr was presumably providing on the ens3 ethernet connection seemed to work well.
  • resolvectl flush-caches didn't seem to have an effect.
  • resolvectl --cache no sometimes failed providing a CNAME record (the wildcard CNAME record on Cloudflare pointed an A record vultr-vps.example.com that was reported as a CNAME) instead of the TXT record that was present/available.
  • When resolvectl was successful, Caddy would follow shortly after in challenge success, this was always the case.
  • Configuring Caddyfile to use a different DNS resolver always worked and is the preferred solution.

Solution

If having DNS issues, and examples above for querying the DNS TXT record for _acme-challenge (which is present on your Cloudflare DNS dashboard) are not working without a custom resolver (like with dog query); then include resolvers 1.1.1.1 to use Cloudflare DNS resolver in the tls directive:

Caddyfile:

{
  acme_ca https://acme-staging-v02.api.letsencrypt.org/directory
  email admin@example.com
}

test.example.com {
  tls {
    dns cloudflare {env.CF_API_TOKEN}
    resolvers 1.1.1.1
  }

  root * /srv/site
  file_server
}