TcpDnsClient cannot recover if registration on TcpConnection times out
nvollmar opened this issue · 5 comments
I uncovered this investigating cluster issues on our nightly deployment test. Since we started to use a low power cpu governor during the night we started seeing issues of a Pekko cluster forming during the nightly deployment.
I've tracked it down to the TcpDnsClient
/ TcpConnection
initialization timing out, leaving it in a state it cannot recover from and never responding to any requests.
The TcpOutgoingConnection
is connecting and responds with a Tcp.Connected
message to the TcpDnsClient
, which in turn registers itself on the connection again:
https://github.com/apache/incubator-pekko/blob/46e60a61fbabce5e3f36a408bfa3d1fb249eef44/actor/src/main/scala/org/apache/pekko/io/dns/internal/TcpDnsClient.scala#L48-L53
If that message arrives late, the TcpOutgoingConnection
will stop itself and TcpDnsClient
has no detection or handling for this case:
This is a very unusual case, but it happens almost every deployment for one or two pods when the system is in low power mode.
Proposed fix: TcpDnsClient
must watch the connection and fail on termination to re-initialize (it is already handled by a backoff supervisor)
The message was sent to the connection immediately, but you observed that the message was late.
I noticed that TcpOutgoingConnection can reply to TcpDnsClient after an exception(postStop). Is it possible that the terminated response of Connection is not received by TcpDnsClient in time due to the lack of active scheduling of low-power CPU?
The TcpDnsClient
didn't receive anything as the TcpOutgoingConnection
just does context.stop(self)
in case of a timeout. The client then is basically dead and can't recover without restarting the actor system.
@pjfanning Since we ran into that a couple of time now, I'd like to backport to 1.0
@nvollmar sure - could you create a cherry pick PR that targets the 1.0.x branch and add that new PR to the 1.0.3 milestone?
Sure, will do