TcpDnsClient cannot recover if registration on TcpConnection times out

Question

TcpDnsClient cannot recover if registration on TcpConnection times out

nvollmar opened this issue 10 months ago · 5 comments

I uncovered this investigating cluster issues on our nightly deployment test. Since we started to use a low power cpu governor during the night we started seeing issues of a Pekko cluster forming during the nightly deployment.

I've tracked it down to the TcpDnsClient / TcpConnection initialization timing out, leaving it in a state it cannot recover from and never responding to any requests.

The TcpOutgoingConnection is connecting and responds with a Tcp.Connected message to the TcpDnsClient, which in turn registers itself on the connection again:
https://github.com/apache/incubator-pekko/blob/46e60a61fbabce5e3f36a408bfa3d1fb249eef44/actor/src/main/scala/org/apache/pekko/io/dns/internal/TcpDnsClient.scala#L48-L53

If that message arrives late, the TcpOutgoingConnection will stop itself and TcpDnsClient has no detection or handling for this case:

https://github.com/apache/incubator-pekko/blob/46e60a61fbabce5e3f36a408bfa3d1fb249eef44/actor/src/main/scala/org/apache/pekko/io/TcpConnection.scala#L104-L108

This is a very unusual case, but it happens almost every deployment for one or two pods when the system is in low power mode.

Proposed fix: TcpDnsClient must watch the connection and fail on termination to re-initialize (it is already handled by a backoff supervisor)

Answer 1 · 2024-03-07T13:13:42.000Z

The message was sent to the connection immediately, but you observed that the message was late.

I noticed that TcpOutgoingConnection can reply to TcpDnsClient after an exception(postStop). Is it possible that the terminated response of Connection is not received by TcpDnsClient in time due to the lack of active scheduling of low-power CPU?

Answer 2 · 2024-03-07T13:39:09.000Z

The TcpDnsClient didn't receive anything as the TcpOutgoingConnection just does context.stop(self) in case of a timeout. The client then is basically dead and can't recover without restarting the actor system.

Answer 3 · 2024-05-07T04:13:41.000Z

@pjfanning Since we ran into that a couple of time now, I'd like to backport to 1.0

Answer 4 · 2024-05-07T09:54:33.000Z

@nvollmar sure - could you create a cherry pick PR that targets the 1.0.x branch and add that new PR to the 1.0.3 milestone?

Answer 5 · 2024-05-07T11:23:08.000Z

Sure, will do