witness connection leaks
Closed this issue · 2 comments
junfx commented
$ repmgr --version
repmgr 5.2.1
$ postgres --version
postgres (PostgreSQL) 13.3
reproduce steps:
- memory on primary(ip
102
) is exhausted - standby(ip
103
) protote itself. There are two primaries in the cluster now. Witness follows the new primary (ip103
) - try make ip
103
standby: stop postgres, standby clone, start postgres, standby register -F - witness(ip
101
) tries to reconnect ip103 (but ip103 is standby again), until no more connection/memory available on ip103
Log on witness:
...
[2021-12-21 20:10:14] [NOTICE] setting primary_node_id to last known ID 2
[2021-12-21 20:10:14] [NOTICE] upstream is available but upstream connection has gone away, resetting
[2021-12-21 20:10:14] [NOTICE] reconnected to upstream node "172.16.56.103-defaultcluster" (ID: 2) after 0 seconds, resuming monitoring
[2021-12-21 20:10:14] [NOTICE] current upstream node "172.16.56.103-defaultcluster" (ID: 2) is not primary, restarting monitoring
[2021-12-21 20:10:14] [WARNING] unable to connect to primary
[2021-12-21 20:10:14] [DETAIL]
connection pointer is NULL
[2021-12-21 20:10:14] [NOTICE] setting primary_node_id to last known ID 2
[2021-12-21 20:10:14] [NOTICE] upstream is available but upstream connection has gone away, resetting
[2021-12-21 20:10:14] [NOTICE] reconnected to upstream node "172.16.56.103-defaultcluster" (ID: 2) after 0 seconds, resuming monitoring
[2021-12-21 20:10:14] [NOTICE] current upstream node "172.16.56.103-defaultcluster" (ID: 2) is not primary, restarting monitoring
[2021-12-21 20:10:14] [WARNING] unable to connect to primary
[2021-12-21 20:10:14] [DETAIL]
connection pointer is NULL
[2021-12-21 20:10:14] [NOTICE] setting primary_node_id to last known ID 2
[2021-12-21 20:10:14] [NOTICE] upstream is available but upstream connection has gone away, resetting
[2021-12-21 20:10:14] [NOTICE] reconnected to upstream node "172.16.56.103-defaultcluster" (ID: 2) after 0 seconds, resuming monitoring
[2021-12-21 20:10:14] [NOTICE] current upstream node "172.16.56.103-defaultcluster" (ID: 2) is not primary, restarting monitoring
[2021-12-21 20:10:14] [WARNING] unable to connect to primary
[2021-12-21 20:10:14] [DETAIL]
connection pointer is NULL
[2021-12-21 20:10:14] [NOTICE] setting primary_node_id to last known ID 2
[2021-12-21 20:10:14] [NOTICE] upstream is available but upstream connection has gone away, resetting
[2021-12-21 20:10:14] [NOTICE] reconnected to upstream node "172.16.56.103-defaultcluster" (ID: 2) after 0 seconds, resuming monitoring
[2021-12-21 20:10:14] [NOTICE] current upstream node "172.16.56.103-defaultcluster" (ID: 2) is not primary, restarting monitoring
[2021-12-21 20:10:14] [WARNING] unable to connect to primary
[2021-12-21 20:10:14] [DETAIL]
connection pointer is NULL
[2021-12-21 20:10:14] [NOTICE] setting primary_node_id to last known ID 2
[2021-12-21 20:10:14] [NOTICE] upstream is available but upstream connection has gone away, resetting
[2021-12-21 20:10:14] [NOTICE] reconnected to upstream node "172.16.56.103-defaultcluster" (ID: 2) after 0 seconds, resuming monitoring
[2021-12-21 20:10:14] [NOTICE] current upstream node "172.16.56.103-defaultcluster" (ID: 2) is not primary, restarting monitoring
[2021-12-21 20:10:14] [WARNING] unable to connect to primary
[2021-12-21 20:10:14] [DETAIL]
connection pointer is NULL
...
netstat -anp | grep repmgr
on witness:
...
tcp 0 0 172.16.56.103:44180 172.16.56.102:5432 TIME_WAIT -
tcp 0 0 172.16.56.103:5432 172.16.56.101:48414 ESTABLISHED 32510/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:47164 ESTABLISHED 31879/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:48664 ESTABLISHED 32639/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:47964 ESTABLISHED 32282/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:47984 ESTABLISHED 32292/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:48194 ESTABLISHED 32399/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:48584 ESTABLISHED 32598/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:47904 ESTABLISHED 32252/lightdb: repm
tcp 0 0 172.16.56.103:44126 172.16.56.102:5432 TIME_WAIT -
tcp 0 0 172.16.56.103:5432 172.16.56.101:48364 ESTABLISHED 32485/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:47684 ESTABLISHED 32141/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:47704 ESTABLISHED 32151/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:47274 ESTABLISHED 31934/lightdb: repm
tcp 0 0 172.16.56.103:48182 172.16.56.102:5432 ESTABLISHED 26999/repmgrd
tcp 0 0 172.16.56.103:5432 172.16.56.101:48284 ESTABLISHED 32445/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:48244 ESTABLISHED 32424/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:48224 ESTABLISHED 32414/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:48524 ESTABLISHED 32568/lightdb: repm
tcp 0 0 172.16.56.103:44260 172.16.56.102:5432 TIME_WAIT -
tcp 0 0 172.16.56.103:5432 172.16.56.101:48684 ESTABLISHED 32649/lightdb: repm
tcp 0 0 172.16.56.103:44340 172.16.56.102:5432 TIME_WAIT -
tcp 0 0 172.16.56.103:5432 172.16.56.101:48424 ESTABLISHED 32515/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:47774 ESTABLISHED 32187/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:48064 ESTABLISHED 32333/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:47504 ESTABLISHED 32050/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:48724 ESTABLISHED 32669/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:47214 ESTABLISHED 31904/lightdb: repm
tcp 0 0 172.16.56.103:5432 172.16.56.101:47794 ESTABLISHED 32197/lightdb: repm
tcp 0 0 172.16.56.103:44096 172.16.56.102:5432 TIME_WAIT -
tcp 0 0 172.16.56.103:5432 172.16.56.101:47914 ESTABLISHED 32257/lightdb: repm
...
junfx commented
After some debuggging, I guess primary_conn
established by check_upstream_connection
is overwriten by primary_conn = establish_db_connection
at repmgrd-physical.c:2463 (repmgr-5.2.1):
if (monitoring_state == MS_DEGRADED)
{
int degraded_monitoring_elapsed = calculate_elapsed(degraded_monitoring_start);
log_debug("monitoring node %i in degraded state for %i seconds",
upstream_node_info.node_id,
degraded_monitoring_elapsed);
if (check_upstream_connection(&primary_conn, upstream_node_info.conninfo, NULL) == true)
{
if (config_file_options.connection_check_type != CHECK_QUERY)
primary_conn = establish_db_connection(upstream_node_info.conninfo, false);
if (PQstatus(primary_conn) == CONNECTION_OK)
{
PQExpBufferData event_details;
upstream_node_info.node_status = NODE_STATUS_UP;