caddyserver/certmagic

Handling canceled contexts in `Unlock` in custom storage implementations

ankon opened this issue ยท 8 comments

ankon commented

What is your question?

This relates to this piece of code:

certmagic/config.go

Lines 504 to 517 in 51b3190

lockKey := cfg.lockKey(certIssueLockOp, name)
err = acquireLock(ctx, cfg.Storage, lockKey)
if err != nil {
return fmt.Errorf("unable to acquire lock '%s': %v", lockKey, err)
}
defer func() {
log.Info("releasing lock", zap.String("identifier", name))
if err := releaseLock(ctx, cfg.Storage, lockKey); err != nil {
log.Error("unable to unlock",
zap.String("identifier", name),
zap.String("lock_key", lockKey),
zap.Error(err))
}
}()

Basically: It is possible that locking works, but the context gets subsequently canceled for whatever reason. The deferred releaseLock call in line 511 will call our custom storage implementation's Unlock with the now-canceled context. Right now that leads to an error in our code, and eventually our lock will expire and we can continue.

Is there anything we can do better here? For instance, would it be reasonable to pass another context into releaseLock/Unlock that is not canceled? Or should we "ignore" the cancelation on our side?

For reference: Our storage uses DynamoDB, and Lock and Unlock basically modify a specific item in it. We pass the context through to the AWS SDK DynamoDB call, which will recognize it is canceled and not execute the actual request.

ankon commented

I found the underlying cause of the canceled context by now: The background renewal of a not-yet-expired certificate is using the same context as was given to GetCertificateWithContext, which in our case is actually the handshake context coming from https://pkg.go.dev/crypto/tls#ClientHelloInfo.Context.

See also #248: The background renewal should use a different context, IMHO it is wrong to assume that it can use the one given to GetCertificate, which is supposed to be called inside the TLS handshake and as such does have a limited time to do things.

mholt commented

Is there anything we can do better here? For instance, would it be reasonable to pass another context into releaseLock/Unlock that is not canceled? Or should we "ignore" the cancelation on our side?

Good question! Hmm.

Part of me thinks, ignore the context because the point is to cleanup after the context is cancelled. The other part of me thinks, if the context is cancelled, it's important to abort to avoid a goroutine leak (or something).

Or maybe we SHOULD make a new context that isn't canceled, but has a timeout, so that unlock can still finish, but the time is set. Is that reasonable? Or do you think the storage implementation should have control of the timeout and set its own when it encounters an already-canceled context?

I found the underlying cause of the canceled context by now: The background renewal of a not-yet-expired certificate is using the same context as was given to GetCertificateWithContext, which in our case is actually the handshake context coming from https://pkg.go.dev/crypto/tls#ClientHelloInfo.Context.

See also #248: The background renewal should use a different context, IMHO it is wrong to assume that it can use the one given to GetCertificate, which is supposed to be called inside the TLS handshake and as such does have a limited time to do things.

I think I agree. Hadn't really considered this, but a background routine indeed should not be confined to a single handshake.

But why haven't we encountered this problem before? ๐Ÿค” (rhetorical question, and mostly for me)

ankon commented

Or maybe we SHOULD make a new context that isn't canceled, but has a timeout, so that unlock can still finish, but the time is set.

I think I have seen something like this when reading other golang code, but I cannot remember where and point to that pattern. In a way it sounds reasonable, and I don't think that I-as-storage should be allowed to take ages to cleanup.

But then again, I couldn't come up with a good timeout here -- 1s? 10s? 10% of the available timeout?


Incidentally, before I forget this yet again:

ctx := context.TODO() // TODO: get a proper context? from somewhere...
the answer to the TODO is probably ctx := clientHello.Context(). Can file PR if that makes sense, but it has this risk that there's another place where this now would break because we missed something somewhere.

... and that might answer your rhetorical question: Because actually GetCertificate passes a TODO context by default.

Go 1.21 adds context.WithoutCancel() that would be useful here. But we can't bump the minimum version for certmagic quite yet (not before Caddy does, and we usually wait till the next release so there's 2 versions supported at any given time)

mholt commented

@ankon

the answer to the TODO is probably ctx := clientHello.Context(). Can file PR if that makes sense, but it has this risk that there's another place where this now would break because we missed something somewhere.

I somehow missed that Go 1.17 added that.

Scanning the code, I think that is a correct solution: the context should be limited to the handshake. If the client aborts the handshake, the server should abort whatever it's doing. UNLESS we start a background maintenance goroutine, in which case we should probably use a different context (done in #248).

mholt commented

As for the context used for locking...

I feel like the most correct thing to do is to pass a fresh new context with our own timeout. Of course, the obvious question is, what should the duration be?

I don't know of course... but maybe like 10s?

I made a Twitter poll just for fun: https://twitter.com/mholt6/status/1689331972114690048 -- but admittedly it's a niche/obscure question.

What do you think, would that work for your deployment?

ankon commented

Quoting myself a bit:

I think passing a canceled context is definitely wrong.
I did find this interesting discussion: golang/go#54775. I'm almost leaning towards "don't pass a context to cleanup at all", or maybe: New context w/o any timeout unless it actually has a meaning in the operation.

I don't think there's any good timeout that certmagic would be able to define, and there's also no good sane way to ask for one in the configuration ("unlock-timeout-when-using-a-weird-storage"?), so it's probably the storage side that needs to know what makes sense.

The referenced discussion reads like "just stop and abandon properly, the context is now canceled and there's nothing wrong with that" -- and in that case I as storage should actually pass a new context to my functions that do funky I/O. Or in other words: If this would be not a function but inlined, then I would still have to call my I/O-y function, and I would have to pass a new context latest at that point.

(Disclaimer: Way out of my depth here, too, let's see what the audience joker gives :D)

What do you think, would that work for your deployment?

In our specific case a 10s timeout should be plenty to basically do a DynamoDB DeleteItem call, so it would work.

mholt commented

Ok, actually, I think your logic makes sense.

I'll change our code to pass in a fresh context that is not canceled and has no timeout. If we encounter problems with hanging unlocks, we can enforce a timeout. (Or the implementation just needs to be fixed; either way.)