Option to defer SIGTERM by a configurable amount of time
jtackaberry opened this issue · 6 comments
Kubernetes has a quirk in which pods behind ingress proxies that terminate quickly during a rolling restart can result in HTTP 5xx. This is due to a race between the Kubelet terminating old pods and the ingress noticing that the endpoints have changed, during which time traffic is forwarded to dead or dying pods. I've described this in a bit more detail over at static-web-server/static-web-server#79.
I'm facing this same issue with another service (this time gunicorn) and it occurs to me rather than trying to solve it in each individual application, a simple init wrapper that defers SIGTERM would let me kill many birds with one stone. I naturally immediately thought of dumb-init.
I've implemented this fairly trivially in dumb-init, but the catch is I'm using alarm(), which means when the feature is enabled and the deferred SIGTERM timer has started, SIGALRM's semantics change to SIGTERM. I see this as a small price to pay.
I'm wondering if the maintainers of dumb-init would be receptive to a pull request with this feature and given the above limitation, or if you consider this to be too far outside dumb-init's purview and it would be better for me to maintain a fork?
Thanks!
adding features to dumb-init kinda goes against the entire idea -- it should be pretty easy to write a little wrapper which does what you want however that you can chain together: dumb-init delays-sigterm yourexe ...
though it sounds like you're very far down an XY-problem rabbit hole and you've misconfigured k8s -- you should take your pods out of your ingress before cycling them
You mean I should build some orchestration on top of an orchestration system :)
I personally see this as a failing of Kubernetes. I'm not even sure what mechanism exists to explicitly withdraw pods from the endpoints slice (which is what ingress controllers use to find targets), especially when you consider that a rolling restart is done automatically when you modify a resource (such as bumping an image version).
I figured perhaps given that signal remapping is about as non-dumb a feature as this one that it might have a shot, considering this is a commonly encountered problem, where the prevailing wisdom has thus far been to bake SIGTERM moratorium logic into the container.
I'll explore the wrapper option, though I suspect given that I'll have to implement most of what dumb-init already does (minus perhaps the remapping) a fork will probably end up being the path of least resistance.
Thanks for offering feedback so quickly!
yeah I don't have this problem with k8s at all -- so I suspect it's something you're doing wrong there and I'd recommend you look into that first before building a thing that nobody needs :P
I think you do have this problem, you just haven't gotten caught up in the race. If you're dealing with slow-to-terminate applications or you aren't beating the hell out of the service during a rolling restart it's easy to go undetected. For lightweight pods that terminate quickly and start quickly, the window can be under a second. Even still, for a busy service, this means surfacing 5xx to users.
This condition is discussed at kubernetes/ingress-gce#769 where it was acknowledged as expected behavior and the underlying containers are supposed to handle it. (I disagree with that position but not much I can do about it.) Also this post goes into a lot more detail https://andrewlock.net/deploying-asp-net-core-applications-to-kubernetes-part-11-avoiding-downtime-in-rolling-deployments-by-blocking-sigterm/.
pretty sure I've got an k8s / envoy setting enabled which handles this class of problem -- but I'm not at that computer 🤷
I'm interested to know if you're able to check next time it's convenient. I don't understand how the ingress controller can have any influence over this. The rolling restart is orchestrated independently: new pods spin up, pass their readiness probes, at which time the containers in the old pods receive SIGTERM. If they terminate before the ingress controller is (asynchronously) notified of the change in endpoints, I don't see how it can avoid trying to forward to dead pods. Possibly there's some retry logic that masks this?