EnterpriseDB/repmgr

witness connection leaks

Closed this issue · 2 comments

junfx commented
$ repmgr --version
repmgr 5.2.1

$ postgres --version
postgres (PostgreSQL) 13.3

reproduce steps:

  1. memory on primary(ip102) is exhausted
  2. standby(ip103) protote itself. There are two primaries in the cluster now. Witness follows the new primary (ip103)
  3. try make ip103 standby: stop postgres, standby clone, start postgres, standby register -F
  4. witness(ip101) tries to reconnect ip103 (but ip103 is standby again), until no more connection/memory available on ip103

Log on witness:

...
[2021-12-21 20:10:14] [NOTICE] setting primary_node_id to last known ID 2
[2021-12-21 20:10:14] [NOTICE] upstream is available but upstream connection has gone away, resetting
[2021-12-21 20:10:14] [NOTICE] reconnected to upstream node "172.16.56.103-defaultcluster" (ID: 2) after 0 seconds, resuming monitoring
[2021-12-21 20:10:14] [NOTICE] current upstream node "172.16.56.103-defaultcluster" (ID: 2) is not primary, restarting monitoring
[2021-12-21 20:10:14] [WARNING] unable to connect to primary
[2021-12-21 20:10:14] [DETAIL]  
connection pointer is NULL

[2021-12-21 20:10:14] [NOTICE] setting primary_node_id to last known ID 2
[2021-12-21 20:10:14] [NOTICE] upstream is available but upstream connection has gone away, resetting
[2021-12-21 20:10:14] [NOTICE] reconnected to upstream node "172.16.56.103-defaultcluster" (ID: 2) after 0 seconds, resuming monitoring
[2021-12-21 20:10:14] [NOTICE] current upstream node "172.16.56.103-defaultcluster" (ID: 2) is not primary, restarting monitoring
[2021-12-21 20:10:14] [WARNING] unable to connect to primary
[2021-12-21 20:10:14] [DETAIL]  
connection pointer is NULL

[2021-12-21 20:10:14] [NOTICE] setting primary_node_id to last known ID 2
[2021-12-21 20:10:14] [NOTICE] upstream is available but upstream connection has gone away, resetting
[2021-12-21 20:10:14] [NOTICE] reconnected to upstream node "172.16.56.103-defaultcluster" (ID: 2) after 0 seconds, resuming monitoring
[2021-12-21 20:10:14] [NOTICE] current upstream node "172.16.56.103-defaultcluster" (ID: 2) is not primary, restarting monitoring
[2021-12-21 20:10:14] [WARNING] unable to connect to primary
[2021-12-21 20:10:14] [DETAIL]  
connection pointer is NULL

[2021-12-21 20:10:14] [NOTICE] setting primary_node_id to last known ID 2
[2021-12-21 20:10:14] [NOTICE] upstream is available but upstream connection has gone away, resetting
[2021-12-21 20:10:14] [NOTICE] reconnected to upstream node "172.16.56.103-defaultcluster" (ID: 2) after 0 seconds, resuming monitoring
[2021-12-21 20:10:14] [NOTICE] current upstream node "172.16.56.103-defaultcluster" (ID: 2) is not primary, restarting monitoring
[2021-12-21 20:10:14] [WARNING] unable to connect to primary
[2021-12-21 20:10:14] [DETAIL]  
connection pointer is NULL

[2021-12-21 20:10:14] [NOTICE] setting primary_node_id to last known ID 2
[2021-12-21 20:10:14] [NOTICE] upstream is available but upstream connection has gone away, resetting
[2021-12-21 20:10:14] [NOTICE] reconnected to upstream node "172.16.56.103-defaultcluster" (ID: 2) after 0 seconds, resuming monitoring
[2021-12-21 20:10:14] [NOTICE] current upstream node "172.16.56.103-defaultcluster" (ID: 2) is not primary, restarting monitoring
[2021-12-21 20:10:14] [WARNING] unable to connect to primary
[2021-12-21 20:10:14] [DETAIL]  
connection pointer is NULL
...

netstat -anp | grep repmgr on witness:


...
tcp        0      0 172.16.56.103:44180     172.16.56.102:5432      TIME_WAIT   -                   
tcp        0      0 172.16.56.103:5432      172.16.56.101:48414     ESTABLISHED 32510/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:47164     ESTABLISHED 31879/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:48664     ESTABLISHED 32639/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:47964     ESTABLISHED 32282/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:47984     ESTABLISHED 32292/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:48194     ESTABLISHED 32399/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:48584     ESTABLISHED 32598/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:47904     ESTABLISHED 32252/lightdb: repm 
tcp        0      0 172.16.56.103:44126     172.16.56.102:5432      TIME_WAIT   -                   
tcp        0      0 172.16.56.103:5432      172.16.56.101:48364     ESTABLISHED 32485/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:47684     ESTABLISHED 32141/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:47704     ESTABLISHED 32151/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:47274     ESTABLISHED 31934/lightdb: repm 
tcp        0      0 172.16.56.103:48182     172.16.56.102:5432      ESTABLISHED 26999/repmgrd       
tcp        0      0 172.16.56.103:5432      172.16.56.101:48284     ESTABLISHED 32445/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:48244     ESTABLISHED 32424/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:48224     ESTABLISHED 32414/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:48524     ESTABLISHED 32568/lightdb: repm 
tcp        0      0 172.16.56.103:44260     172.16.56.102:5432      TIME_WAIT   -                   
tcp        0      0 172.16.56.103:5432      172.16.56.101:48684     ESTABLISHED 32649/lightdb: repm 
tcp        0      0 172.16.56.103:44340     172.16.56.102:5432      TIME_WAIT   -                   
tcp        0      0 172.16.56.103:5432      172.16.56.101:48424     ESTABLISHED 32515/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:47774     ESTABLISHED 32187/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:48064     ESTABLISHED 32333/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:47504     ESTABLISHED 32050/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:48724     ESTABLISHED 32669/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:47214     ESTABLISHED 31904/lightdb: repm 
tcp        0      0 172.16.56.103:5432      172.16.56.101:47794     ESTABLISHED 32197/lightdb: repm 
tcp        0      0 172.16.56.103:44096     172.16.56.102:5432      TIME_WAIT   -                   
tcp        0      0 172.16.56.103:5432      172.16.56.101:47914     ESTABLISHED 32257/lightdb: repm 
...
junfx commented

After some debuggging, I guess primary_conn established by check_upstream_connection is overwriten by primary_conn = establish_db_connection at repmgrd-physical.c:2463 (repmgr-5.2.1):

if (monitoring_state == MS_DEGRADED)
{
	int			degraded_monitoring_elapsed = calculate_elapsed(degraded_monitoring_start);

	log_debug("monitoring node %i in degraded state for %i seconds",
			  upstream_node_info.node_id,
			  degraded_monitoring_elapsed);

	if (check_upstream_connection(&primary_conn, upstream_node_info.conninfo, NULL) == true)
	{
		if (config_file_options.connection_check_type != CHECK_QUERY)
			primary_conn = establish_db_connection(upstream_node_info.conninfo, false);

		if (PQstatus(primary_conn) == CONNECTION_OK)
		{
			PQExpBufferData event_details;

			upstream_node_info.node_status = NODE_STATUS_UP;

Thanks for the report; this has been fixed via commit 2dce8e1 (PR #732).