postgresml/pgcat

Intermittent SocketError("Error reading message code from socket - Error Kind(UnexpectedEof)") when accessing pgcat through AWS NLB in EKS cluster

Closed this issue · 2 comments

Cluas commented

Description:

I have deployed pgcat in an EKS cluster, and the client is also in the same EKS cluster. When the client pod accesses pgcat through an AWS NLB, the logs intermittently show the following error with a high frequency:

[2023-04-11T07:15:39.839836Z WARN pgcat] Client disconnected with error SocketError("Error reading message code from socket - Error Kind(UnexpectedEof)")
Here's an example of the log entries:

2023-04-11 15:15:39	
[2023-04-11T07:15:39.839836Z WARN  pgcat] Client disconnected with error SocketError("Error reading message code from socket - Error Kind(UnexpectedEof)")Show context
...
2023-04-11 14:55:07
[2023-04-11T06:55:07.206480Z WARN  pgcat] Client disconnected with error SocketError("Error reading message code from socket - Error Kind(UnexpectedEof)")

Environment:

  • pgcat v1
  • pgcat deployed on EKS
  • Client pod deployed on the same EKS cluster
  • pgcat accessed via AWS NLB

I am looking for a solution to resolve these intermittent errors. Any help or guidance would be appreciated.

levkk commented

It's the NLB healthcheck opening up and closing a TCP connection. We should add an option to ignore these and don't log anything (PgBouncer has a similar option).

@levkk This issue seems that NLB resetting the connection because no data through it longer than the NLB idle timeout.
See detail on https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout

For each TCP request that a client makes through a Network Load Balancer, the state of that connection is tracked. If no data is sent through the connection by either the client or target for longer than the idle timeout, the connection is closed. If a client or target sends data after the idle timeout period elapses, it receives a TCP RST packet to indicate that the connection is no longer valid.

We set the idle timeout value for TCP flows to 350 seconds. You can't modify this value. Clients or targets can use TCP keepalive packets to reset the idle timeout. Keepalive packets sent to maintain TLS connections can't contain data or payload.

As recommended, tcp keepalive should be enabled on pgcat listener for a stable long-lived connection.