tazjin/kubernetes-letsencrypt

Error creating new authz :: too many currently pending authorizations

Opened this issue · 5 comments

drigz commented

Using kubernetes-letsencrypt v1.7 with Cloud DNS and GKE, we've observed a "too many currently pending authorizations" error. This is surprising, since the limit is 300 pending authorizations, but we only have ~10 certificates on the domain. kubernetes-letsencrypt was previously working fine, but when a new team member tried to bring up their own cluster, they ran into this issue.

On the Let's Encrypt forums, schoen said:

So I think the likeliest interpretation is [...] it sometimes request an authorization and then not use it (either requesting an authorization when not requesting a certificate, or requesting an authorization and then crashing or exiting before the corresponding certificate can be requested). This could, for example, be a renewal-related bug if one part of the code says "this certificate should be renewed now" but another part of the code says "this certificate is not yet due for renewal".

and

Maybe this does lead to some useful guidance for client developers: if you get an authz for one requested domain but fail to get it for another, make sure you proactively destroy the first authz before giving up. (If your error was based on repeated failed attempts to get a certificate for a mixture of names you do and don't control, that might be the underlying problem here.)

Is that possible? If we see it again, what can we do to get more debug information?

org.shredzone.acme4j.exception.AcmeRateLimitExceededException: Error creating new authz :: too many currently pending authorizations
        at org.shredzone.acme4j.connector.DefaultConnection.createAcmeException(DefaultConnection.java:394)
        at org.shredzone.acme4j.connector.DefaultConnection.accept(DefaultConnection.java:199)
        at org.shredzone.acme4j.Registration.authorizeDomain(Registration.java:189)
        at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.getAuthorization(CertificateRequestHandler.kt:90)
        at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.authorizeDomain(CertificateRequestHandler.kt:68)
        at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.access$authorizeDomain(CertificateRequestHandler.kt:27)
        at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler$requestCertificate$1.accept(CertificateRequestHandler.kt:41)
        at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler$requestCertificate$1.accept(CertificateRequestHandler.kt:27)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
        at java.util.Collections$2.tryAdvance(Collections.java:4717)
        at java.util.Collections$2.forEachRemaining(Collections.java:4725)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
        at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
        at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401)
        at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
        at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
        at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583)
        at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.requestCertificate(CertificateRequestHandler.kt:41)
        at in.tazj.k8s.letsencrypt.kubernetes.ServiceManager.handleCertificateRequest(ServiceManager.kt:64)
        at in.tazj.k8s.letsencrypt.kubernetes.ServiceManager.access$handleCertificateRequest(ServiceManager.kt:20)
        at in.tazj.k8s.letsencrypt.kubernetes.ServiceManager$reconcileService$1.run(ServiceManager.kt:45)
        at java.lang.Thread.run(Thread.java:745)
drigz commented

I've looked in the logs for the kubernetes-letsencrypt and noticed two things.

One: the CloudDnsResponder threw an exception early on:

Exception in thread "Thread-2" java.lang.UnsupportedOperationException: Empty collection can't be reduced.
    at in.tazj.k8s.letsencrypt.acme.CloudDnsResponder.findMatchingZone(CloudDnsResponder.kt:123)
    at in.tazj.k8s.letsencrypt.acme.CloudDnsResponder.updateCloudDnsRecord(CloudDnsResponder.kt:55)
    at in.tazj.k8s.letsencrypt.acme.CloudDnsResponder.addChallengeRecord(CloudDnsResponder.kt:26)
    at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.prepareDnsChallenge(CertificateRequestHandler.kt:176)
    at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.authorizeDomain(CertificateRequestHandler.kt:77)
    at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.access$authorizeDomain(CertificateRequestHandler.kt:27)
    at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler$requestCertificate$1.accept(CertificateRequestHandler.kt:41)
    at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler$requestCertificate$1.accept(CertificateRequestHandler.kt:27)
    [SNIP: java.util.stream.*]
    at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.requestCertificate(CertificateRequestHandler.kt:41)
    at in.tazj.k8s.letsencrypt.kubernetes.ServiceManager.handleCertificateRequest(ServiceManager.kt:64)
    at in.tazj.k8s.letsencrypt.kubernetes.ServiceManager.access$handleCertificateRequest(ServiceManager.kt:20)
    at in.tazj.k8s.letsencrypt.kubernetes.ServiceManager$reconcileService$1.run(ServiceManager.kt:45)
    at java.lang.Thread.run(Thread.java:745)

This appears to be because our Cloud DNS configuration had the wrong zone, so the responder didn't work.

Two: this error occurs 300 times before the rate limit error takes its place. This takes about an hour because the operation is retried very frequently. The retries continue, leading to rate limit errors every 45 seconds or so.

Two things that could help this:

  • The authz should be deleted if the CloudDnsResponder crashes, to avoid hitting the "pending authorizations" limit.
  • Exponential backoff should be used in case of failures.

Thanks for reporting this, I'll look into handling this more gracefully!

drigz commented

Thanks! FYI, as a workaround, we deleted the letsencrypt-keypair secret. This makes kubernetes-letsencrypt create a new user with an empty quota.

kubectl --namespace kube-system delete secret letsencrypt-keypair

drigz commented

Note: LE just enabled pending authorization recycling, which might (help) avoid this issue:

https://community.letsencrypt.org/t/automatic-recycling-of-pending-authorizations/41321

Interesting! I started working on the issues you reported yesterday - but time is currently a scarce resource :-)