postgresml/pgcat

Timeouts on slow queries cause a server to get banned

Closed this issue · 1 comments

Is your feature request related to a problem? Please describe.
I would like my server to not get banned if there's a statement timeout due to a table lock or a slow query.

Describe the solution you'd like
I would like server health to be determined by a health check query (select 1) instead of the client's query, which would mean the server would not get banned just because someone tried to run a slow query. This behaviour could be behind a configuration flag.

Describe alternatives you've considered

Additional context

Logs:

2023-10-16T08:54:55.172159Z  INFO ThreadId(06) pgcat::client: Client 172.19.15.235:41922 connected (TLS)
2023-10-16T08:55:00.229157Z ERROR ThreadId(06) pgcat::client: Statement timeout while talking to Address { id: 1, host: "pg-main-read-replica-2", port: 5432, shard: 0, database: "db_name", role: Replica, replica_number: 1, address_index: 1, username: "money_srv", pool_name: "db_name", mirrors: [], stats: AddressStats { total: AddressStatFields { xact_count: 38, query_count: 127, bytes_received: 7594, bytes_sent: 4798, xact_time: 0, query_time: 49, wait_time: 19913, errors: 0 }, current: AddressStatFields { xact_count: 1, query_count: 7, bytes_received: 503, bytes_sent: 249, xact_time: 0, query_time: 0, wait_time: 1282, errors: 0 }, averages: AddressStatFields { xact_count: 0, query_count: 0, bytes_received: 0, bytes_sent: 0, xact_time: 0, query_time: 0, wait_time: 0, errors: 0 }, averages_updated: false }, error_count: 0 } with user money_srv
2023-10-16T08:55:00.229215Z ERROR ThreadId(06) pgcat::server: Server Address { id: 1, host: "pg-main-read-replica-2", port: 5432, shard: 0, database: "db_name", role: Replica, replica_number: 1, address_index: 1, username: "money_srv", pool_name: "db_name", mirrors: [], stats: AddressStats { total: AddressStatFields { xact_count: 38, query_count: 127, bytes_received: 7594, bytes_sent: 4798, xact_time: 0, query_time: 49, wait_time: 19913, errors: 0 }, current: AddressStatFields { xact_count: 1, query_count: 7, bytes_received: 503, bytes_sent: 249, xact_time: 0, query_time: 0, wait_time: 1282, errors: 0 }, averages: AddressStatFields { xact_count: 0, query_count: 0, bytes_received: 0, bytes_sent: 0, xact_time: 0, query_time: 0, wait_time: 0, errors: 0 }, averages_updated: false }, error_count: 0 } marked bad
2023-10-16T08:55:00.229225Z ERROR ThreadId(06) pgcat::pool: Banning instance Address { id: 1, host: "pg-main-read-replica-2", port: 5432, shard: 0, database: "db_name", role: Replica, replica_number: 1, address_index: 1, username: "money_srv", pool_name: "db_name", mirrors: [], stats: AddressStats { total: AddressStatFields { xact_count: 38, query_count: 127, bytes_received: 7594, bytes_sent: 4798, xact_time: 0, query_time: 49, wait_time: 19913, errors: 0 }, current: AddressStatFields { xact_count: 1, query_count: 7, bytes_received: 503, bytes_sent: 249, xact_time: 0, query_time: 0, wait_time: 1282, errors: 0 }, averages: AddressStatFields { xact_count: 0, query_count: 0, bytes_received: 0, bytes_sent: 0, xact_time: 0, query_time: 0, wait_time: 0, errors: 0 }, averages_updated: false }, error_count: 0 }, reason: StatementTimeout
2023-10-16T08:55:00.229495Z  INFO ThreadId(06) pgcat::server: Server connection terminated Address { id: 1, host: "pg-main-read-replica-2", port: 5432, shard: 0, database: "db_name", role: Replica, replica_number: 1, address_index: 1, username: "money_srv", pool_name: "db_name", mirrors: [], stats: AddressStats { total: AddressStatFields { xact_count: 38, query_count: 127, bytes_received: 7594, bytes_sent: 4798, xact_time: 0, query_time: 49, wait_time: 19913, errors: 1 }, current: AddressStatFields { xact_count: 1, query_count: 7, bytes_received: 503, bytes_sent: 249, xact_time: 0, query_time: 0, wait_time: 1282, errors: 1 }, averages: AddressStatFields { xact_count: 0, query_count: 0, bytes_received: 0, bytes_sent: 0, xact_time: 0, query_time: 0, wait_time: 0, errors: 0 }, averages_updated: false }, error_count: 0 }, session duration: 0d 00:03:23.542
2023-10-16T08:55:00.229735Z  WARN ThreadId(06) pgcat: Client disconnected with error StatementTimeout
2023-10-16T08:55:00.793977Z  INFO ThreadId(06) pgcat::client: Client 172.19.0.16:51980 connected (TLS)
2023-10-16T08:55:05.837082Z ERROR ThreadId(04) pgcat::client: Statement timeout while talking to Address { id: 2, host: "pg-hydra-test.test.svc.cluster.local", port: 5432, shard: 0, database: "db_name", role: Replica, replica_number: 2, address_index: 2, username: "money_srv", pool_name: "db_name", mirrors: [], stats: AddressStats { total: AddressStatFields { xact_count: 27, query_count: 83, bytes_received: 4260, bytes_sent: 3503, xact_time: 0, query_time: 31, wait_time: 60566, errors: 1 }, current: AddressStatFields { xact_count: 0, query_count: 0, bytes_received: 0, bytes_sent: 0, xact_time: 0, query_time: 0, wait_time: 0, errors: 0 }, averages: AddressStatFields { xact_count: 0, query_count: 0, bytes_received: 11, bytes_sent: 15, xact_time: 0, query_time: 0, wait_time: 77, errors: 0 }, averages_updated: false }, error_count: 0 } with user money_srv
2023-10-16T08:55:05.837133Z ERROR ThreadId(04) pgcat::server: Server Address { id: 2, host: "pg-hydra-test.test.svc.cluster.local", port: 5432, shard: 0, database: "db_name", role: Replica, replica_number: 2, address_index: 2, username: "money_srv", pool_name: "db_name", mirrors: [], stats: AddressStats { total: AddressStatFields { xact_count: 27, query_count: 83, bytes_received: 4260, bytes_sent: 3503, xact_time: 0, query_time: 31, wait_time: 60566, errors: 1 }, current: AddressStatFields { xact_count: 0, query_count: 0, bytes_received: 0, bytes_sent: 0, xact_time: 0, query_time: 0, wait_time: 0, errors: 0 }, averages: AddressStatFields { xact_count: 0, query_count: 0, bytes_received: 11, bytes_sent: 15, xact_time: 0, query_time: 0, wait_time: 77, errors: 0 }, averages_updated: false }, error_count: 0 } marked bad  
2023-10-16T08:55:05.837143Z ERROR ThreadId(04) pgcat::pool: Banning instance Address { id: 2, host: "pg-hydra-test.test.svc.cluster.local", port: 5432, shard: 0, database: "db_name", role: Replica, replica_number: 2, address_index: 2, username: "money_srv", pool_name: "db_name", mirrors: [], stats: AddressStats { total: AddressStatFields { xact_count: 27, query_count: 83, bytes_received: 4260, bytes_sent: 3503, xact_time: 0, query_time: 31, wait_time: 60566, errors: 1 }, current: AddressStatFields { xact_count: 0, query_count: 0, bytes_received: 0, bytes_sent: 0, xact_time: 0, query_time: 0, wait_time: 0, errors: 0 }, averages: AddressStatFields { xact_count: 0, query_count: 0, bytes_received: 11, bytes_sent: 15, xact_time: 0, query_time: 0, wait_time: 77, errors: 0 }, averages_updated: false }, error_count: 0 }, reason: StatementTimeout

I've realised that this only happens when pgcat's own statement_timeout is reached, but not the one set in PSQL, which is the one I was worried about