Renewal op retries when an expired certificate cannot be renewed
ankon opened this issue ยท 12 comments
What version of the package are you using?
github.com/caddyserver/certmagic v0.19.1
What are you trying to do?
We're using certmagic to manage certificates for a number of hostnames we host websites for. Unfortunately we do not control DNS for these websites, and it can happen that users accidentally break parts of the setup that prevent certificate renewals.
What steps did you take?
This is somewhat theoretical, but based on our current understanding of what we're seeing:
- Have a expired certificate in storage (and possibly cached?)
- Prevent ACME challenges from succeeding (i.e. no HTTP-01, no TLS-ALPN-01, no DNS-01 should work)
- Request the certificate (in our case: This happens through the
GetCertificate
function as part of the handshake)
What did you expect to happen, and what actually happened instead?
I would expect an error: The certificate is expired, and shouldn't be served, nor is it in its "renewal window".
I would also expect that the error will take a while to be returned, because presumably we would see a request to obtain a new certificate.
What actually seems to happen is that certmagic perpetually tries to renew the certificate, times out (expected), and then retries renewal.
From our monitoring system a stack trace, with the operation at this point having run for almost 14h
How do you think this should be fixed?
Couple of things I found interesting and possibly relevant here:
- An error from
handshakeMaintenance
is logged but ignored inloadCertFromStorage
. This might be ok and intentional, but I wonder whether it should check the type of error ("is this ok to defer" vs "does this block any usage of the certificate?"):Lines 405 to 410 in 51b3190
currentlyInRenewalWindow
returnstrue
even when the certificate is expired:Lines 88 to 103 in 51b3190
false
in that case, and the caller to explicitly do an obtain-the-certificate.
A bit as an aside: The timeouts involved in renewing ("2 minutes", "90s" when expired, "5min" when time left) look a bit arbitrary, should these be made configurable?
Please link to any related issues, pull requests, and/or discussion
Bonus: What do you use CertMagic for, and do you find it useful?
We absolute do find it useful! :)
We absolute do find it useful! :)
Thank you! Glad you find it useful.
Unfortunately we do not control DNS for these websites
Make sure you're using on-demand TLS then (which it looks like you are). That's best for when you don't control the domains.
What actually seems to happen is that certmagic perpetually tries to renew the certificate, times out (expected), and then retries renewal.
I believe this is working as intended, actually!
I would expect an error: The certificate is expired, and shouldn't be served, nor is it in its "renewal window".
An expired certificate is considered "in the renewal window." A certificate that is still configured to be served will try to be renewed, ideally before it expires, but after if it must.
What this loop you're seeing is doing is trying to renew a certificate. When it sees that another goroutine (i.e. another TLS handshake) is already renewing it, it waits until it is done so it can use the cert it gets. That's the case <-wait:
line you can see in renewDynamicCertificate()
. Of course, we have no idea if it succeeded, so we proceed to load the certificate from storage (which will be a no-op if it already was by the other goroutine), and check its expiry. If it's still within the renewal window, then we know the other goroutine failed to renew it. Since we still need the cert renewed, we try again.
It does limit the retries for 2 minutes. It is arbitrary, yes, but do you have a suggestion for a better timeout? I have been trying to think of a better limit that makes more sense, but haven't come up with one. Hence the TODO in the code:
// TODO: see if we can get a proper context in here, for true cancellation
timeout := time.NewTimer(2 * time.Minute)
I'll happily use more sensible timeouts if we can figure those out.
And while I agree that expired certs shouldn't be served, ideally, I've found it's still better to serve the certificate as-is to not cover up what is actually happening. It just makes troubleshooting easier when it's obvious: "Oh, the cert is expired." Then you don't have to go digging for other problems, wondering why the server isn't serving any certificate.
Anyway, what you're seeing isn't (shouldn't be) an infinite loop because we do have a hard deadline set for these operations. Let me know if you confirm that you observe otherwise.
have a hard deadline set for these operations
Well ... my handshake was taking ~14h, before I restarted the server :)
So, yeah, I see the hard deadlines, but what we also see is that they do not lead to certmagic actually aborting. "Something" here needs to stop the chain of calling loadCertFromStorage
-> handshakeMaintenance
-> loadCertFromStorage
.
We do use a custom storage though, so thinking aloud: What if I were to check expiration, and refuse to return a cert that has already been expired?
Looking at it the other way around: If I read your answer correctly, then the assumption is that in the case of an expired certificate we would either do a blocking refresh (taking ~up to 2min), or would wait for another thread to do a blocking refresh (taking presumably also up to ~2min), and then we should return an error from getCertDuringHandshake
?)
Ah, I think you might need a timeout on the connection itself (configured in the server).
Perhaps we should have a recursion deadline too; but at the same time, if a client is waiting for a handshake, why give up? AFAIK it doesn't hurt to retry to serve the client.
Looking at it the other way around: If I read your answer correctly, then the assumption is that in the case of an expired certificate we would either do a blocking refresh (taking ~up to 2min), or would wait for another thread to do a blocking refresh (taking presumably also up to ~2min), and then we should return an error from getCertDuringHandshake?), and then we should return an error from getCertDuringHandshake?)
Correct -- worth noting that ONLY expired certificates have blocking maintenance. Certs that haven't yet expired are renewed in the background and don't block handshakes, even though the handshake is what triggers the renewal.
Are you using the DecisionFunc to allow only recognized domain names? If you no longer serve a customer, for example, you should not allow the server to continue managing a cert for those names. (Unless scripts/bots are programmed with their own resolvers, domains that no longer point to your server should not be making handshakes that often.)
Are you using the DecisionFunc to allow only recognized domain names?
Yes we do. The problem is that in this case from our side we should serve the page, it just so happens that the customer has put a proxy in the middle that "eats" the challenge requests. This isn't per se painful (the request trigger some storage lookups, but those happen with long delays between them!), but it disturbs monitoring information :)
Ah, I think you might need a timeout on the connection itself (configured in the server).
Actually that's a good point. I need to verify that we actually do have one in place for the handshake!
Well ... my handshake was taking ~14h, before I restarted the server :)
Huh... that shouldn't be happening ๐ค We set timeouts whenever we start a renewal operation or wait for a renewal operation to complete. I wonder where that handshake is hanging at. Can you perhaps isolate its stack trace? The localhost:2019/debug/pprof
endpoint has a link to get a goroutine dump, which can tell us where the code is at. This may be easier with fewer requests, in fact only one, to the problematic domain, that hangs for hours. Then I can know better where we may be missing a timeout.
Note that this isn't caddy, it's a go server using certmagic -- we're still busy moving towards caddy.
Here's the trace from the screenshot:
/src/database/dynamodb/certmagic.go (dynamodb.(*certmagicStorage).Load:157) [this is our storage loading things]
/src/vendor/github.com/caddyserver/certmagic/crypto.go (certmagic.(*Config).loadCertResource:253)
/src/vendor/github.com/caddyserver/certmagic/crypto.go (certmagic.(*Config).loadCertResourceAnyIssuer:182)
/src/vendor/github.com/caddyserver/certmagic/certificates.go (certmagic.(*Config).loadManagedCertificate:146)
/src/vendor/github.com/caddyserver/certmagic/certificates.go (certmagic.(*Config).CacheManagedCertificate:133)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:390)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).handshakeMaintenance:592)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:405)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).handshakeMaintenance:592)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:405)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).handshakeMaintenance:592)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:405)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).handshakeMaintenance:592)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:405)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).handshakeMaintenance:592)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:405)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).handshakeMaintenance:592)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:405)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).handshakeMaintenance:592)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:405)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).handshakeMaintenance:592)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:405)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).handshakeMaintenance:592)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:405)
/src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
I need to verify that we actually do have one in place for the handshake!
Ok, did some stepping and further reading here.
From what I can see: The context passed in GetCertificate
(via https://pkg.go.dev/crypto/tls#ClientHelloInfo.Context) does not have a timeout, but it internally does have a deadline set on both reading and writing (https://cs.opensource.google/go/go/+/refs/tags/go1.20.7:src/net/http/server.go;l=1871-1876, in our case because we have a read timeout set).
Now from my understanding that's "nice", but it doesn't help for GetCertificate
, as that never reads or writes from the connection.
We do not explicitly add any further timeout into the context we pass to certmagic's GetCertificateWithContext
.
I think what might be happening here requires additional requests: I can see in our logs that the attempt to renew the certificate fail after the first or second attempt ("will retry" messages with retrying_in of 60/120, attempts 1), with an error from Lets Encrypt (rate limited):
I also see the expected errors from the requesting place ("renewing and reloading certificate"), with varying reasons (context canceled directly or when in our storage doing I/O, ACME errors).
This will fail the renewal for the caller, so handshakeMaintenance
returns (something), err
, and I do see the errors from its caller loadCertFromStorage
("maintaining newly-loaded certificate"). If this call to loadCertFromStorage
came from getCertDuringHandshake
, which will now proceed with whatever they have (expired certificate, if I read correctly). The handshake will likely fail, and the HTTPS client will be unhappy and tell the user. That's fine.
BUT: If while we're attempting to renew other requests came in, they would now be queued up and waiting for the 2 minutes timer. The moment the initial handshakeMaintenance
receives the error it will call unblockWaiters
, and all these requests will now call loadCertFromStorage
. This will fetch the expired certificate again, attempt the maintenance, and again have one request doing it (and failing), while the others will recurse and try again.
Am I reading this correctly?
D'oh, sorry -- sometimes I forget which repo I'm in and that CertMagic != Caddy ๐
Let me read through this flow and see if I come to the same conclusion. What you're saying sounds like it is probably right, if not very close, so let me just refresh my memory and I'll get back to this ๐
Ok, sorry that took me a minute. Yes, I think your explanation is accurate:
- First goroutine claims a lock for that name and tries to get the cert.
- Other goroutines that need that cert will see that first goroutine is already working on it, and wait.
- Once the first goroutine has either succeeded or failed, it will unblock the waiters.
- They will try loading the cert from storage.
Hmm, I think that is incorrect though. It probably should be:
- Once the waiters are unblocked, they simply use the certificate already loaded into the cache by the first goroutine.
- If the certificate is not in the cache, then something went wrong. Either they should repeat or just return the same error.
I guess it comes down to a question of... if one goroutine fails, should the others wait and retry or just give up?
I'm leaning toward they should give up and let the clients retry again if desired. Not sure why an immediate retry and recursion would actually be helpful here. (If you can think of something, feel free to let me know.)
So: should I change the code so that the waiters just unblock and use the result of the first goroutine? And if there isn't one, they also return an error?
Intuitively: The waiters should return the result of what they are waiting for. In this case the result is either a certificate ready to use, or a reason why that wasn't possible to provide (the error).
That the communication to get these results happens via the cache as short-term storage is just an implementation detail. You could conceptually also use a "singleflight" (https://pkg.go.dev/golang.org/x/sync/singleflight), for instance.
Gotcha. Ok, I'll work that into #249: the waiter will either return a certificate or an error, not recurse.
I think this is fixed now based on my trials but we can reopen this if needed!