istio/ztunnel

Implement improved draining

Opened this issue · 14 comments

Related:

There are a few shutdown sequence we need to handle

  1. Pod is shutting down:

We already get the delworkload message when the pod is done, not when it starts exiting. We need to shutdown immediately, not do a long drain.

Ideally, we also have a graceful shutdown. That is, when the pod is gone, clients know to close their connections. For TCP, this is no problem - when the connection is dropped they should get a RST (TODO: verify this happens, since the veth is ripped out from us!).
For HBONE, we want to send a GOAWAY and tls close_alert to gracefully shutdown. We cannot do this after the pod shuts down, since the CNI will remove the veth!

This is on a per-pod basis

  1. Ztunnel is shutting down
    In this case, we want to have a long drain period to allow new connections to gracefully connect to the new ztunnel, and let existing connections naturally die out.

Options for graceful shutdown:

(1) Add a new ShutdownStarting message to ZDS. Upon this, send GOAWAYs to clients.

Mixed: draining happens immediately instead of at the end of the pod. This probably doesn't matter; this type of backpressure is more useful when the client's are not mesh-aware. HBONE clients are, generally. This does mean we close pooled connections a bit faster though, which is maybe a small win... unless we legitimately would use the pooled connections.
Cons: More complexity in ZDS

(2) Move DelWorkload hook to a CNI DEL cmd. This ensures we run after the application shuts down, but before the network is destroyed (I guess DEL runs in reverse order? and we are always the last plugin). We can then drain and ACK once we are done, then pod shutdown can complete. One nice thing is this shutdown happens out of band to the pod -- the pod is complete removed while this is taking place. So users won't see "pod is taking a while to shutdown and causing issues"; its invisible. Obviously this shouldn't be slow, though.

Pro: "perfect" timing to do last-mile shutdown sequences
Pro: guarantee shutdown occurs even without re-syncing a snapshot to ztunnel; Kubernetes will indefinitely retry a failed pod sandbox
Cons: maybe indefinitely retrying something that is ~invisible is not a good thing, if we somehow end up in a state where the CNI plugin is deployed but we cannot connect to ztunnel?
Cons: conceptually simple, but plausibly triggers a variety of strange platform interactions we didn't think of

(3) Workaround on client side: when we see a pod is deleted (via WDS), remove any pooled connections to it

Pros: effectively allows us to drop pooled connections.
Cons: hard to implement. We need a watch on the object, tying the pool to XDS. Additionally, we need some way to pipe this message through to the pool which is also tricky.
Cons: only helps ztunnel clients.

(4) Do nothing. Client's will eventually have keepalives timeout and drop the HBONE connections.

Pros: simple
Cons: we hold onto stale connections until the keepalive timeout. During this time, we get a bunch of ping attempts.
Cons: we either log these timeouts as errors (noisy/confusing logs on ever pod deletion) or hide all ping timeout messages (maybe masking legitimate issues)

Yeah - in general I think the following are true:

  • Graceful drain is only useful for HBONE connections between ztunnels, because there taking the time to GOAWAY is helpful.

  • There are 3 different "shutdown" scenarios ->

    • "ztunnel" is shutting down -> (all workload proxies must drain, and K8S will eventually kill us if we take too long)
    • "ztunnel" is NOT shutting down, but the workload X is -> (workload X proxy must drain, we have until the pod netns goes away)
    • "ztunnel" is NOT shutting down, and workload X is NOT shutting down, but is being unhooked -> (workload X proxy must drain, the netns isn't going anywhere)

I think the latter two scenarios are ~roughly the same thing, and probably require some kind of ACK to the node agent, so it can know when the entire "remove pod" flow is actually done. Otherwise I think the state might get a bit ambiguous? Probably most of this will be answered in impl tho.

I have a slight pref for (1) I think. I am wary of getting too hooked into the CNI plugin lifecycle, and have some preference for being opportunistic with pod cleanup/shutdown rather than trying to manage it or stall it, since at the end of the day we don't control that, and it's not our domain.

Cons: maybe indefinitely retrying something that is ~invisible is not a good thing, if we somehow end up in a state where the CNI plugin is deployed but we cannot connect to ztunnel?

Yeah, I think this is the kind of thing that gives me pause. It's harder to explain to operators, it's not terribly intuitive, and corner cases are hard to reason about.

  • What does GOAWAYing ztunnel<->ztunnel HBONE connections actually give us? It doesn't help the client apps/user workloads very much, they aren't aware they're being proxied and already have made the connection, and so are unavoidably going to get some unsignaled interruption. What's the worst-case behavior without it, for a HBONE connection?

Note that for (2) we could also maybe use GC: https://github.com/containernetworking/cni/blob/main/SPEC.md#gc-clean-up-any-stale-resources

I am not sure GC works since we need to do our cleanup before the rest of the CNI does (else we cannot send GOAWAY since the veth is gone).

What does GOAWAYing ztunnel<->ztunnel HBONE connections actually give us? It doesn't help the client apps very much, they aren't aware they're being proxied and already have made the connection. What's the worst-case behavior without it, for a HBONE connection?

Its not about the client apps. If we don't goaway, we have a few issues:

  • Clients will retain pooled connections to us. We can keepalive these to close them out, but that is not ideal (see above commentary)
  • (minor) Not sending close_alert on TLS connections is not recommend and makes rustls spit out errors about not shutting down properly
  • Clients will retain pooled connections to us. We can keepalive these to close them out, but that is not ideal (see above commentary)

Can we safely factor out the use of keepalives for HBONE connections entirely if we do this, or would we still need them as a backup/failsafe to avoid keeping around stale conns if there are unexpected disruptions (I think this is the current state)?

If we can safely in all cases drop keepalives if we do this, it's worth doing. If we still need keepalives as a Plan B no matter what, then to me it feels less so.

I think keepalives are a good general practice. There is no guarantee a TCP connection stays live -- there can ALWAYS be something that non-gracefully terminates. But they also shouldn't be used as the primary way to close a connection

I guess DEL runs in reverse order? and we are always the last plugin

Yeah that's part of the spec so we can rely on that.

I think keepalives are a good general practice. There is no guarantee a TCP connection stays live -- there can ALWAYS be something that non-gracefully terminates. But they also shouldn't be used as the primary way to close a connection

Conceptually and hygienically I agree with that, but in this specific case where the thing we care about protecting (the workload/client app) is already isolated from this, I don't know if practically speaking it makes a difference unless there are things we (that is, ztunnel) need to clean up that we can't clean up in other ways.

I don't think we should let these decisions tie to whether we use keepalives. They are either useful or not to detect non-graceful broken connections, and that doesn't really change just because we remove one way a connection can be broken non-graceful broken connections; there are inherently always going to be these. I am not an expert here but it seems like its pretty low-risk, medium-reward to have so worth keeping

One thing I am concerned with (1) is, consider a case where I have pod-a sending traffic to pod-b, even after termination has started:

With (1), once pod-b starts to terminate, we are going to open up new HBONE connections for each new app connection between pod-a and pod-b. The end result, from users POV, is increased CPU/latency when the pod is shutting down.

We do similar in sidecars, but it has a purpose: it tells the application to retry to another backend that is not shutting down.

There is really no backpressure mechanism in hbone. The only thing we need to is to tell the client to drop their pooled connections gracefully. That is better done at the last minute, which (2) can provide

One thing I am concerned with (1) is, consider a case where I have pod-a sending traffic to pod-b, even after termination has started:

With (1), once pod-b starts to terminate, we are going to open up new HBONE connections for each new app connection between pod-a and pod-b. The end result, from users POV, is increased CPU/latency when the pod is shutting down.

Right - in both cases, all the connections the user app (pod-a) is making will fail (from their perspective) - we just won't have the signaling in place to immediately prevent pod-a from making conns by rejecting them at the source (pod-a's ztunnel). It will be a bit delayed from the perspective of pod-a, and the pod-a ztunnel might use a smidge more resources trying to establish pooled connections that will inevitably fail.

Do we know that that's significant? It's probably worse for same-node. This is also something we could likely entirely mitigate practically with a relatively simple clientside backoff in the HBONE pool, which wouldn't be the worst idea to have regardless (I guess that's option 3)

We do similar in sidecars, but it has a purpose: it tells the application to retry to another backend that is not shutting down.

In theory, this could still be the case tho here? Even in (1) we are backpressuring to the app, ultimately, which may have its own retry logic, which may connect to pod-c which isn't shutting down.

The only benefit of (2) that I can see is we bubble up the backpressure to the app a bit faster, but again an HBONE pool clientside backoff would probably solve this effectively as well, and would be a good idea generally.

Right - in both cases, all the connections the user app (pod-a) is making will fail (from their perspective)

No, that is not the case. Maybe I am not explaining the scenario properly.

Pod B starts terminating, but is not terminated yet. I.e. delationTimestamp is non-null, but pod phase is running.

During this time, per our rules, Pod B MUST continue to accept traffic.

In both cases, the traffic will succeed. In (1), each connection will open a new HBONE connection. In (2), we will pool like normal.

Maybe I have recency bias after exploring istio/istio#51855, where similar logic caused the app to disable pooling which caused effectively a DOS. (note: issue is not about ambient, just same idea).

Pod B starts terminating, but is not terminated yet. I.e. delationTimestamp is non-null, but pod phase is running.

During this time, per our rules, Pod B MUST continue to accept traffic.

In both cases, the traffic will succeed. In (1), each connection will open a new HBONE connection. In (2), we will pool like normal.

Maybe I have recency bias after exploring istio/istio#51855, where similar logic caused the app to disable pooling which caused effectively a DOS. (note: issue is not about ambient, just same idea).

Ah ok. It's a bit of a chicken-or-egg problem. And yeah - we probably should just take no action at all on "TERM_START". We have to proxy traffic until the last possible second, or we will inevitably negatively interact with $WHATEVER_POOLING_OR_SHUTDOWN_LOGIC a given client/server want to perform.

A combination of 3 and 4 is probably still the simplest and least likely to interfere there - keepalives (which we already have) + clientside pool backoff (which we arguably want anyway, since we could still have the same kind of backpressure problem if the dest ztunnel is down but the dest apps aren't) on consecutive failures. We proxy everything we can until the last possible moment (when the dest process terminates), and the instant we can't, we bubble the backpressure up to the client app (which does whatever it wants).

One thing to clarify is GOAWAY doesn't close the inner connections. It just means once there no more active inner connections, the out connection will be dropped instead of saved for pooling.

So it is viable to send GOAWAY before the last second, it just may not be ideal.